Speaker Management
-Dialog Editor
+Type | +Speaker | +Text / Duration | +Actions | +
---|
Results
+Show Generation Log
+(Generation log will appear here)+
Concatenated Audio:
+ +Download Archive:
+ +(ZIP download link will appear here)
+diff --git a/.note/current_focus.md b/.note/current_focus.md index 7a8bb46..c591690 100644 --- a/.note/current_focus.md +++ b/.note/current_focus.md @@ -1,20 +1,23 @@ -# Current Focus +# Chatterbox TTS Migration: Backend Development (FastAPI) -**Date:** 2025-06-05 +**Primary Goal:** Implement the FastAPI backend for TTS dialog generation. -**Primary Goal:** Initiate the migration of the Chatterbox TTS dialog generator from Gradio to a vanilla JavaScript frontend and FastAPI backend. +**Recent Accomplishments (Phase 1, Step 2 - Speaker Management):** -**Recent Accomplishments:** +- Created Pydantic models for speaker data (`speaker_models.py`). +- Implemented `SpeakerManagementService` (`speaker_service.py`) for CRUD operations on speakers (metadata in `speakers.yaml`, samples in `speaker_samples/`). +- Created FastAPI router (`routers/speakers.py`) with endpoints: `GET /api/speakers`, `POST /api/speakers`, `GET /api/speakers/{id}`, `DELETE /api/speakers/{id}`. +- Integrated speaker router into the main FastAPI app (`main.py`). +- Successfully tested all speaker API endpoints using `curl`. -- Set up the `.note/` Memory Bank directory and essential files. -- Reviewed `gradio_app.py` to understand existing dialog generation logic. -- Developed a detailed, phased plan for re-implementing the dialog generation functionality with FastAPI and Vanilla JS. This plan has been saved to `.note/detailed_migration_plan.md`. +**Current Task (Phase 1, Step 3 - TTS Core):** -**Current Task:** +- **Develop `TTSService` in `backend/app/services/tts_service.py`.** + - Focus on `ChatterboxTTS` model loading, inference, and critical memory management. + - Define methods for speech generation using speaker samples. + - Manage TTS parameters (exaggeration, cfg_weight, temperature). -- Awaiting your feedback on the detailed migration plan (see `.note/detailed_migration_plan.md`). +**Next Immediate Steps:** -**Next Steps (pending your approval of plan):** - -- Begin Phase 1: Backend API Development (FastAPI). - - Task 1.1: Project Setup (FastAPI project structure, `requirements.txt`). +1. Finalize and test the initial implementation of `TTSService`. +2. Proceed to Phase 1, Step 4: Dialog Processing - Implement `DialogProcessorService` including text splitting logic. diff --git a/.note/detailed_migration_plan.md b/.note/detailed_migration_plan.md index d4b4aba..95a03b7 100644 --- a/.note/detailed_migration_plan.md +++ b/.note/detailed_migration_plan.md @@ -4,91 +4,95 @@ This plan outlines the steps to re-implement the dialog generation features of t ## 1. Backend (FastAPI) Development -**Objective:** Create a robust API to handle TTS generation, speaker management, and file delivery. +### Objective -**Key Modules/Components:** +Create a robust API to handle TTS generation, speaker management, and file delivery. -* **API Endpoints:** - * `POST /api/dialog/generate`: - * **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`. - * **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`. - * `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`). - * `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`. - * `DELETE /api/speakers/{speaker_id}`: Removes a speaker. -* **Core Logic & Services:** - * `TTSService`: - * Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup). - * Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults). - * Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept). - * `DialogProcessorService`: - * Orchestrates dialog generation using `TTSService`. - * Implements `split_text_at_sentence_boundaries` logic for long text inputs. - * Manages generation of individual audio segments. - * `AudioManipulationService`: - * Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences. - * Creates ZIP archives of all generated audio files using `zipfile`. - * `SpeakerManagementService`: - * Manages `speakers.yaml` (or alternative storage) for speaker metadata. - * Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`). -* **File Handling:** - * Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage). +### Key Modules/Components -**Implementation Steps (Phase 1):** +* **API Endpoints:** + * `POST /api/dialog/generate`: + * **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`. + * **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`. + * `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`). + * `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`. + * `DELETE /api/speakers/{speaker_id}`: Removes a speaker. +* **Core Logic & Services:** + * `TTSService`: + * Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup). + * Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults). + * Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept). + * `DialogProcessorService`: + * Orchestrates dialog generation using `TTSService`. + * Implements `split_text_at_sentence_boundaries` logic for long text inputs. + * Manages generation of individual audio segments. + * `AudioManipulationService`: + * Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences. + * Creates ZIP archives of all generated audio files using `zipfile`. + * `SpeakerManagementService`: + * Manages `speakers.yaml` (or alternative storage) for speaker metadata. + * Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`). +* **File Handling:** + * Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage). -1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`). -2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints. -3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management. -4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting. -5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping. -6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services. -7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings. -8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`. +### Implementation Steps (Phase 1) + +1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`). +2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints. +3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management. +4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting. +5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping. +6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services. +7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings. +8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`. ## 2. Frontend (Vanilla JavaScript) Development -**Objective:** Create an intuitive UI for dialog construction, speaker management, and interaction with the backend. +### Objective -**Key Modules/Components:** +Create an intuitive UI for dialog construction, speaker management, and interaction with the backend. -* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display. -* **CSS (`style.css`):** Styling for a clean and usable interface. -* **JavaScript (`app.js`, `api.js`, `ui.js`):** - * `api.js`: Functions for all backend API communications (`fetch`). - * `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering. - * `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data). +### Key Modules/Components -**Implementation Steps (Phase 2):** +* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display. +* **CSS (`style.css`):** Styling for a clean and usable interface. +* **JavaScript (`app.js`, `api.js`, `ui.js`): + * `api.js`: Functions for all backend API communications (`fetch`). + * `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering. + * `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data). -1. **Basic Layout:** Create `index.html` and `style.css`. -2. **API Client:** Develop `api.js` to interface with all backend endpoints. -3. **Speaker UI:** - * Fetch and display speakers using `ui.js` and `api.js`. - * Implement forms and logic for adding (with file upload) and removing speakers. -4. **Dialog Editor UI:** - * Dynamically add/remove/reorder dialog lines (speech/silence). - * Inputs for speaker selection (populated from API), text, and silence duration. - * Input for `output_base_name`. -5. **Interaction & Results:** - * "Generate Dialog" button to submit data via `api.js`. - * Display generation log, audio player for concatenated output, and download link for ZIP file. +### Implementation Steps (Phase 2) + +1. **Basic Layout:** Create `index.html` and `style.css`. +2. **API Client:** Develop `api.js` to interface with all backend endpoints. +3. **Speaker UI:** + * Fetch and display speakers using `ui.js` and `api.js`. + * Implement forms and logic for adding (with file upload) and removing speakers. +4. **Dialog Editor UI:** + * Dynamically add/remove/reorder dialog lines (speech/silence). + * Inputs for speaker selection (populated from API), text, and silence duration. + * Input for `output_base_name`. +5. **Interaction & Results:** + * "Generate Dialog" button to submit data via `api.js`. + * Display generation log, audio player for concatenated output, and download link for ZIP file. ## 3. Integration & Testing (Phase 3) -1. **Full System Connection:** Ensure seamless frontend-backend communication. -2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions. -3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed. -4. **UX Refinement:** Iterate on UI/UX based on testing feedback. +1. **Full System Connection:** Ensure seamless frontend-backend communication. +2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions. +3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed. +4. **UX Refinement:** Iterate on UI/UX based on testing feedback. ## 4. Advanced Features & Deployment (Phase 4) -* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]) -* **Real-time Updates:** Consider WebSockets for live progress during generation. -* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets. +* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]) +* **Real-time Updates:** Consider WebSockets for live progress during generation. +* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets. ## Key Considerations from `gradio_app.py` Analysis -* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly. -* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated. -* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend. -* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API. -* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive. +* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly. +* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated. +* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend. +* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API. +* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive. diff --git a/.note/session_log.md b/.note/session_log.md index 1bfd8fa..f39b057 100644 --- a/.note/session_log.md +++ b/.note/session_log.md @@ -1,5 +1,25 @@ # Session Log +--- +**Session Start:** 2025-06-05 (Continued) + +**Goal:** Progress Phase 1 of Chatterbox TTS backend migration: Initial Project Setup. + +**Key Activities & Insights:** +- Created `backend/app/main.py` with a basic FastAPI application instance. +- Confirmed user has an existing `.venv` at the project root. +- Updated `backend/README.md` to reflect usage of the root `.venv` instead of a backend-specific one. + - Adjusted venv activation paths and command execution locations (project root). +- Installed backend dependencies from `backend/requirements.txt` into the root `.venv`. +- Successfully ran the basic FastAPI server using `uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000` from the project root. +- Verified the API is accessible. +- Confirmed all Memory Bank files are present. Reviewed `current_focus.md` and `session_log.md`. + +**Next Steps:** +- Update `current_focus.md` and `session_log.md`. +- Proceed to Phase 1, Step 2: Speaker Management. +--- + --- **Session Start:** 2025-06-05 diff --git a/babel.config.cjs b/babel.config.cjs new file mode 100644 index 0000000..af8ceb1 --- /dev/null +++ b/babel.config.cjs @@ -0,0 +1,13 @@ +// babel.config.cjs +module.exports = { + presets: [ + [ + '@babel/preset-env', + { + targets: { + node: 'current', // Target the current version of Node.js + }, + }, + ], + ], +}; diff --git a/backend/README.md b/backend/README.md new file mode 100644 index 0000000..03ab194 --- /dev/null +++ b/backend/README.md @@ -0,0 +1,34 @@ +# Chatterbox TTS Backend + +This directory contains the FastAPI backend for the Chatterbox TTS application. + +## Project Structure + +- `app/`: Contains the main FastAPI application code. + - `__init__.py`: Makes `app` a Python package. + - `main.py`: FastAPI application instance and core API endpoints. + - `services/`: Business logic for TTS, dialog processing, etc. + - `models/`: Pydantic models for API request/response. + - `utils/`: Utility functions. +- `requirements.txt`: Project dependencies for the backend. +- `README.md`: This file. + +## Setup & Running + +It is assumed you have a Python virtual environment at the project root (e.g., `.venv`). + +1. Navigate to the **project root** directory (e.g., `/Volumes/SAM2/CODE/chatterbox-test`). +2. Activate the existing Python virtual environment: + ```bash + source .venv/bin/activate # On macOS/Linux + # .\.venv\Scripts\activate # On Windows + ``` +3. Install dependencies (ensure your terminal is in the **project root**): + ```bash + pip install -r backend/requirements.txt + ``` +4. Run the development server (ensure your terminal is in the **project root**): + ```bash + uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000 + ``` +The API should then be accessible at `http://127.0.0.1:8000`. diff --git a/backend/app/__init__.py b/backend/app/__init__.py new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/backend/app/__init__.py @@ -0,0 +1 @@ + diff --git a/backend/app/config.py b/backend/app/config.py new file mode 100644 index 0000000..70cd037 --- /dev/null +++ b/backend/app/config.py @@ -0,0 +1,19 @@ +from pathlib import Path + +# Determine PROJECT_ROOT dynamically. +# If config.py is at /Volumes/SAM2/CODE/chatterbox-test/backend/app/config.py +# then PROJECT_ROOT (/Volumes/SAM2/CODE/chatterbox-test) is 2 levels up. +PROJECT_ROOT = Path(__file__).resolve().parents[2] + +# Speaker data paths +SPEAKER_DATA_BASE_DIR = PROJECT_ROOT / "speaker_data" +SPEAKER_SAMPLES_DIR = SPEAKER_DATA_BASE_DIR / "speaker_samples" +SPEAKERS_YAML_FILE = SPEAKER_DATA_BASE_DIR / "speakers.yaml" + +# TTS temporary output path (used by DialogProcessorService) +TTS_TEMP_OUTPUT_DIR = PROJECT_ROOT / "tts_temp_outputs" + +# Final dialog output path (used by Dialog router and served by main app) +# These are stored within the 'backend' directory to be easily servable. +DIALOG_OUTPUT_PARENT_DIR = PROJECT_ROOT / "backend" +DIALOG_GENERATED_DIR = DIALOG_OUTPUT_PARENT_DIR / "tts_generated_dialogs" diff --git a/backend/app/main.py b/backend/app/main.py new file mode 100644 index 0000000..2d7849b --- /dev/null +++ b/backend/app/main.py @@ -0,0 +1,43 @@ +from fastapi import FastAPI +from fastapi.staticfiles import StaticFiles +from fastapi.middleware.cors import CORSMiddleware +from pathlib import Path +from app.routers import speakers, dialog # Import the routers +from app import config + +app = FastAPI( + title="Chatterbox TTS API", + description="API for generating TTS dialogs using Chatterbox TTS.", + version="0.1.0", +) + +# CORS Middleware configuration +origins = [ + "http://localhost:8001", + "http://127.0.0.1:8001", + # Add other origins if needed, e.g., your deployed frontend URL +] + +app.add_middleware( + CORSMiddleware, + allow_origins=origins, + allow_credentials=True, + allow_methods=["*"], # Allows all methods + allow_headers=["*"], # Allows all headers +) + +# Include routers +app.include_router(speakers.router, prefix="/api/speakers", tags=["Speakers"]) +app.include_router(dialog.router, prefix="/api/dialog", tags=["Dialog Generation"]) + +@app.get("/") +async def read_root(): + return {"message": "Welcome to the Chatterbox TTS API!"} + +# Ensure the directory for serving generated audio exists +config.DIALOG_GENERATED_DIR.mkdir(parents=True, exist_ok=True) + +# Mount StaticFiles to serve generated dialogs +app.mount("/generated_audio", StaticFiles(directory=config.DIALOG_GENERATED_DIR), name="generated_audio") + +# Further endpoints for speakers, dialog generation, etc., will be added here. diff --git a/backend/app/models/__init__.py b/backend/app/models/__init__.py new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/backend/app/models/__init__.py @@ -0,0 +1 @@ + diff --git a/backend/app/models/dialog_models.py b/backend/app/models/dialog_models.py new file mode 100644 index 0000000..e198adc --- /dev/null +++ b/backend/app/models/dialog_models.py @@ -0,0 +1,43 @@ +from pydantic import BaseModel, Field, validator +from typing import List, Union, Literal, Optional + +class DialogItemBase(BaseModel): + type: str + +class SpeechItem(DialogItemBase): + type: Literal['speech'] = 'speech' + speaker_id: str = Field(..., description="ID of the speaker for this speech segment.") + text: str = Field(..., description="Text content to be synthesized.") + exaggeration: Optional[float] = Field(0.5, description="Controls the expressiveness of the speech. Higher values lead to more exaggerated speech. Default from Gradio.") + cfg_weight: Optional[float] = Field(0.5, description="Classifier-Free Guidance weight. Higher values make the speech more aligned with the prompt text and speaker characteristics. Default from Gradio.") + temperature: Optional[float] = Field(0.8, description="Controls randomness in generation. Lower values make speech more deterministic, higher values more varied. Default from Gradio.") + +class SilenceItem(DialogItemBase): + type: Literal['silence'] = 'silence' + duration: float = Field(..., gt=0, description="Duration of the silence in seconds.") + +class DialogRequest(BaseModel): + dialog_items: List[Union[SpeechItem, SilenceItem]] = Field(..., description="A list of speech and silence items.") + output_base_name: str = Field(..., description="Base name for the output files (e.g., 'my_dialog_v1'). Extensions will be added automatically.") + + @validator('dialog_items', pre=True, each_item=True) + def check_item_type(cls, item): + if not isinstance(item, dict): + raise ValueError("Each dialog item must be a dictionary.") + item_type = item.get('type') + if item_type == 'speech': + # Pydantic will handle further validation based on SpeechItem model + return item + elif item_type == 'silence': + # Pydantic will handle further validation based on SilenceItem model + return item + raise ValueError(f"Unknown dialog item type: {item_type}. Must be 'speech' or 'silence'.") + +class DialogResponse(BaseModel): + log: str = Field(description="Log of the dialog generation process.") + # For now, these URLs might be relative paths or placeholders. + # Actual serving strategy will determine the final URL format. + concatenated_audio_url: Optional[str] = Field(None, description="URL/path to the concatenated audio file.") + zip_archive_url: Optional[str] = Field(None, description="URL/path to the ZIP archive of all audio files.") + temp_dir_path: Optional[str] = Field(None, description="Path to the temporary directory holding generated files, for server-side reference.") + error_message: Optional[str] = Field(None, description="Error message if the process failed globally.") diff --git a/backend/app/models/speaker_models.py b/backend/app/models/speaker_models.py new file mode 100644 index 0000000..1283ed7 --- /dev/null +++ b/backend/app/models/speaker_models.py @@ -0,0 +1,20 @@ +from pydantic import BaseModel +from typing import Optional + +class SpeakerBase(BaseModel): + name: str + +class SpeakerCreate(SpeakerBase): + # For receiving speaker name, file will be handled separately by FastAPI's UploadFile + pass + +class Speaker(SpeakerBase): + id: str + sample_path: Optional[str] = None # Path to the speaker's audio sample + + class Config: + from_attributes = True # Replaces orm_mode = True in Pydantic v2 + +class SpeakerResponse(SpeakerBase): + id: str + message: Optional[str] = None diff --git a/backend/app/routers/__init__.py b/backend/app/routers/__init__.py new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/backend/app/routers/__init__.py @@ -0,0 +1 @@ + diff --git a/backend/app/routers/dialog.py b/backend/app/routers/dialog.py new file mode 100644 index 0000000..2661512 --- /dev/null +++ b/backend/app/routers/dialog.py @@ -0,0 +1,189 @@ +from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks +from pathlib import Path +import shutil + +from app.models.dialog_models import DialogRequest, DialogResponse +from app.services.tts_service import TTSService +from app.services.speaker_service import SpeakerManagementService +from app.services.dialog_processor_service import DialogProcessorService +from app.services.audio_manipulation_service import AudioManipulationService +from app import config + +router = APIRouter() + +# --- Dependency Injection for Services --- +# These can be more sophisticated with a proper DI container or FastAPI's Depends system if services had complex init. +# For now, direct instantiation or simple Depends is fine. + +def get_tts_service(): + # Consider making device configurable + return TTSService(device="mps") + +def get_speaker_management_service(): + return SpeakerManagementService() + +def get_dialog_processor_service( + tts_service: TTSService = Depends(get_tts_service), + speaker_service: SpeakerManagementService = Depends(get_speaker_management_service) +): + return DialogProcessorService(tts_service=tts_service, speaker_service=speaker_service) + +def get_audio_manipulation_service(): + return AudioManipulationService() + +# --- Helper function to manage TTS model loading/unloading --- +async def manage_tts_model_lifecycle(tts_service: TTSService, task_function, *args, **kwargs): + """Loads TTS model, executes task, then unloads model.""" + try: + print("API: Loading TTS model...") + tts_service.load_model() + return await task_function(*args, **kwargs) + except Exception as e: + # Log or handle specific exceptions if needed before re-raising + print(f"API: Error during TTS model lifecycle or task execution: {e}") + raise + finally: + print("API: Unloading TTS model...") + tts_service.unload_model() + +async def process_dialog_flow( + request: DialogRequest, + dialog_processor: DialogProcessorService, + audio_manipulator: AudioManipulationService, + background_tasks: BackgroundTasks +) -> DialogResponse: + """Core logic for processing the dialog request.""" + processing_log_entries = [] + concatenated_audio_file_path = None + zip_archive_file_path = None + final_temp_dir_path_str = None + + try: + # 1. Process dialog to generate segments + # The DialogProcessorService creates its own temp dir for segments + dialog_processing_result = await dialog_processor.process_dialog( + dialog_items=[item.model_dump() for item in request.dialog_items], + output_base_name=request.output_base_name + ) + processing_log_entries.append(dialog_processing_result['log']) + segment_details = dialog_processing_result['segment_files'] + temp_segment_dir = Path(dialog_processing_result['temp_dir']) + final_temp_dir_path_str = str(temp_segment_dir) + + # Filter out error segments for concatenation and zipping + valid_segment_paths_for_concat = [ + Path(s['path']) for s in segment_details + if s['type'] == 'speech' and s.get('path') and Path(s['path']).exists() + ] + + # Create a list of dicts suitable for concatenation service (speech paths and silence durations) + items_for_concatenation = [] + for s_detail in segment_details: + if s_detail['type'] == 'speech' and s_detail.get('path') and Path(s_detail['path']).exists(): + items_for_concatenation.append({'type': 'speech', 'path': s_detail['path']}) + elif s_detail['type'] == 'silence' and 'duration' in s_detail: + items_for_concatenation.append({'type': 'silence', 'duration': s_detail['duration']}) + # Errors are already logged by DialogProcessor + + if not any(item['type'] == 'speech' for item in items_for_concatenation): + message = "No valid speech segments were generated. Cannot create concatenated audio or ZIP." + processing_log_entries.append(message) + return DialogResponse( + log="\n".join(processing_log_entries), + temp_dir_path=final_temp_dir_path_str, + error_message=message + ) + + # 2. Concatenate audio segments + config.DIALOG_GENERATED_DIR.mkdir(parents=True, exist_ok=True) + concat_filename = f"{request.output_base_name}_concatenated.wav" + concatenated_audio_file_path = config.DIALOG_GENERATED_DIR / concat_filename + + audio_manipulator.concatenate_audio_segments( + segment_results=items_for_concatenation, + output_concatenated_path=concatenated_audio_file_path + ) + processing_log_entries.append(f"Concatenated audio saved to: {concatenated_audio_file_path}") + + # 3. Create ZIP archive + zip_filename = f"{request.output_base_name}_dialog_output.zip" + zip_archive_path = config.DIALOG_GENERATED_DIR / zip_filename + + # Collect all valid generated speech segment files for zipping + individual_segment_paths = [ + Path(s['path']) for s in segment_details + if s['type'] == 'speech' and s.get('path') and Path(s['path']).exists() + ] + + # concatenated_audio_file_path is already defined and checked for existence before this block + + audio_manipulator.create_zip_archive( + segment_file_paths=individual_segment_paths, + concatenated_audio_path=concatenated_audio_file_path, + output_zip_path=zip_archive_path + ) + processing_log_entries.append(f"ZIP archive created at: {zip_archive_path}") + + # Schedule cleanup of the temporary segment directory + # background_tasks.add_task(shutil.rmtree, temp_segment_dir, ignore_errors=True) + # processing_log_entries.append(f"Scheduled cleanup for temporary segment directory: {temp_segment_dir}") + # For now, let's not auto-delete, so user can inspect. Cleanup can be a separate endpoint/job. + processing_log_entries.append(f"Temporary segment directory for inspection: {temp_segment_dir}") + + return DialogResponse( + log="\n".join(processing_log_entries), + # URLs should be relative to a static serving path, e.g., /generated_audio/ + # For now, just returning the name, assuming they are in DIALOG_OUTPUT_DIR + concatenated_audio_url=f"/generated_audio/{concat_filename}", + zip_archive_url=f"/generated_audio/{zip_filename}", + temp_dir_path=final_temp_dir_path_str + ) + + except FileNotFoundError as e: + error_msg = f"File not found during dialog generation: {e}" + processing_log_entries.append(error_msg) + raise HTTPException(status_code=404, detail=error_msg) + except ValueError as e: + error_msg = f"Invalid value or configuration: {e}" + processing_log_entries.append(error_msg) + raise HTTPException(status_code=400, detail=error_msg) + except RuntimeError as e: + error_msg = f"Runtime error during dialog generation: {e}" + processing_log_entries.append(error_msg) + # This could be a 500 if it's an unexpected server error + raise HTTPException(status_code=500, detail=error_msg) + except Exception as e: + import traceback + error_msg = f"An unexpected error occurred: {e}\n{traceback.format_exc()}" + processing_log_entries.append(error_msg) + raise HTTPException(status_code=500, detail=error_msg) + finally: + # Ensure logs are captured even if an early exception occurs before full response construction + if not concatenated_audio_file_path and not zip_archive_file_path and processing_log_entries: + print("Dialog generation failed. Log: \n" + "\n".join(processing_log_entries)) + +@router.post("/generate", response_model=DialogResponse) +async def generate_dialog_endpoint( + request: DialogRequest, + background_tasks: BackgroundTasks, + tts_service: TTSService = Depends(get_tts_service), + dialog_processor: DialogProcessorService = Depends(get_dialog_processor_service), + audio_manipulator: AudioManipulationService = Depends(get_audio_manipulation_service) +): + """ + Generates a dialog from a list of speech and silence items. + - Processes text into manageable chunks. + - Generates speech for each chunk using the specified speaker. + - Inserts silences as requested. + - Concatenates all audio segments into a single file. + - Creates a ZIP archive of all individual segments and the concatenated file. + """ + # Wrap the core processing logic with model loading/unloading + return await manage_tts_model_lifecycle( + tts_service, + process_dialog_flow, + request=request, + dialog_processor=dialog_processor, + audio_manipulator=audio_manipulator, + background_tasks=background_tasks + ) diff --git a/backend/app/routers/speakers.py b/backend/app/routers/speakers.py new file mode 100644 index 0000000..c5cedfe --- /dev/null +++ b/backend/app/routers/speakers.py @@ -0,0 +1,81 @@ +from typing import List, Annotated +from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, Form + +from app.models.speaker_models import Speaker, SpeakerResponse +from app.services.speaker_service import SpeakerManagementService + +router = APIRouter( + tags=["Speakers"], + responses={404: {"description": "Not found"}}, +) + +# Dependency to get the speaker service instance +# This could be more sophisticated with a proper DI system later +def get_speaker_service(): + return SpeakerManagementService() + +@router.get("/", response_model=List[Speaker]) +async def get_all_speakers( + service: Annotated[SpeakerManagementService, Depends(get_speaker_service)] +): + """ + Retrieve all available speakers. + """ + return service.get_speakers() + +@router.post("/", response_model=SpeakerResponse, status_code=201) +async def create_new_speaker( + name: Annotated[str, Form()], + audio_file: Annotated[UploadFile, File()], + service: Annotated[SpeakerManagementService, Depends(get_speaker_service)] +): + """ + Add a new speaker. + Requires speaker name (form data) and an audio sample file (file upload). + """ + if not audio_file.filename: + raise HTTPException(status_code=400, detail="No audio file provided.") + if not audio_file.content_type or not audio_file.content_type.startswith("audio/"): + raise HTTPException(status_code=400, detail="Invalid audio file type. Please upload a valid audio file (e.g., WAV, MP3).") + + try: + new_speaker = await service.add_speaker(name=name, audio_file=audio_file) + return SpeakerResponse( + id=new_speaker.id, + name=new_speaker.name, + message="Speaker added successfully." + ) + except HTTPException as e: + # Re-raise HTTPExceptions from the service (e.g., file save error) + raise e + except Exception as e: + # Catch-all for other unexpected errors + raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {str(e)}") + + +@router.get("/{speaker_id}", response_model=Speaker) +async def get_speaker_details( + speaker_id: str, + service: Annotated[SpeakerManagementService, Depends(get_speaker_service)] +): + """ + Get details for a specific speaker by ID. + """ + speaker = service.get_speaker_by_id(speaker_id) + if not speaker: + raise HTTPException(status_code=404, detail="Speaker not found") + return speaker + +@router.delete("/{speaker_id}", response_model=dict) +async def remove_speaker( + speaker_id: str, + service: Annotated[SpeakerManagementService, Depends(get_speaker_service)] +): + """ + Delete a speaker by ID. + """ + deleted = service.delete_speaker(speaker_id) + if not deleted: + raise HTTPException(status_code=404, detail="Speaker not found or could not be deleted.") + return {"message": "Speaker deleted successfully"} + diff --git a/backend/app/services/__init__.py b/backend/app/services/__init__.py new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/backend/app/services/__init__.py @@ -0,0 +1 @@ + diff --git a/backend/app/services/audio_manipulation_service.py b/backend/app/services/audio_manipulation_service.py new file mode 100644 index 0000000..c483d15 --- /dev/null +++ b/backend/app/services/audio_manipulation_service.py @@ -0,0 +1,241 @@ +import torch +import torchaudio +from pathlib import Path +from typing import List, Dict, Union, Tuple +import zipfile + +# Define a common sample rate, e.g., from the TTS model. This should ideally be configurable or dynamically obtained. +# For now, let's assume the TTS model (ChatterboxTTS) outputs at a known sample rate. +# The ChatterboxTTS model.sr is 24000. +DEFAULT_SAMPLE_RATE = 24000 + +class AudioManipulationService: + def __init__(self, default_sample_rate: int = DEFAULT_SAMPLE_RATE): + self.sample_rate = default_sample_rate + + def _load_audio(self, file_path: Union[str, Path]) -> Tuple[torch.Tensor, int]: + """Loads an audio file and returns the waveform and sample rate.""" + try: + waveform, sr = torchaudio.load(file_path) + return waveform, sr + except Exception as e: + raise RuntimeError(f"Error loading audio file {file_path}: {e}") + + def _create_silence(self, duration_seconds: float) -> torch.Tensor: + """Creates a silent audio tensor of a given duration.""" + num_frames = int(duration_seconds * self.sample_rate) + return torch.zeros((1, num_frames)) # Mono silence + + def concatenate_audio_segments( + self, + segment_results: List[Dict], + output_concatenated_path: Path + ) -> Path: + """ + Concatenates audio segments and silences into a single audio file. + + Args: + segment_results: A list of dictionaries, where each dict represents an audio + segment or a silence. Expected format: + For speech: {'type': 'speech', 'path': 'path/to/audio.wav', ...} + For silence: {'type': 'silence', 'duration': 0.5, ...} + output_concatenated_path: The path to save the final concatenated audio file. + + Returns: + The path to the concatenated audio file. + """ + all_waveforms: List[torch.Tensor] = [] + current_sample_rate = self.sample_rate # Assume this initially, verify with first loaded audio + + for i, segment_info in enumerate(segment_results): + segment_type = segment_info.get("type") + + if segment_type == "speech": + audio_path_str = segment_info.get("path") + if not audio_path_str: + print(f"Warning: Speech segment {i} has no path. Skipping.") + continue + + audio_path = Path(audio_path_str) + if not audio_path.exists(): + print(f"Warning: Audio file {audio_path} for segment {i} not found. Skipping.") + continue + + try: + waveform, sr = self._load_audio(audio_path) + # Ensure consistent sample rate. Resample if necessary. + # For simplicity, this example assumes all inputs will match self.sample_rate + # or the first loaded audio's sample rate. A more robust implementation + # would resample if sr != current_sample_rate. + if i == 0 and not all_waveforms: # First audio segment sets the reference SR if not default + current_sample_rate = sr + if sr != self.sample_rate: + print(f"Warning: First audio segment SR ({sr} Hz) differs from service default SR ({self.sample_rate} Hz). Using segment SR.") + + if sr != current_sample_rate: + print(f"Warning: Sample rate mismatch for {audio_path} ({sr} Hz) vs expected ({current_sample_rate} Hz). Resampling...") + resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=current_sample_rate) + waveform = resampler(waveform) + + # Ensure mono. If stereo, take the mean or first channel. + if waveform.shape[0] > 1: + waveform = torch.mean(waveform, dim=0, keepdim=True) + + all_waveforms.append(waveform) + except Exception as e: + print(f"Error processing speech segment {audio_path}: {e}. Skipping.") + + elif segment_type == "silence": + duration = segment_info.get("duration") + if duration is None or not isinstance(duration, (int, float)) or duration < 0: + print(f"Warning: Silence segment {i} has invalid duration. Skipping.") + continue + silence_waveform = self._create_silence(float(duration)) + all_waveforms.append(silence_waveform) + + elif segment_type == "error": + # Errors are already logged by DialogProcessorService, just skip here. + print(f"Skipping segment {i} due to previous error: {segment_info.get('message')}") + continue + + else: + print(f"Warning: Unknown segment type '{segment_type}' at index {i}. Skipping.") + + if not all_waveforms: + raise ValueError("No valid audio segments or silences found to concatenate.") + + # Concatenate all waveforms + final_waveform = torch.cat(all_waveforms, dim=1) + + # Ensure output directory exists + output_concatenated_path.parent.mkdir(parents=True, exist_ok=True) + + # Save the concatenated audio + try: + torchaudio.save(str(output_concatenated_path), final_waveform, current_sample_rate) + print(f"Concatenated audio saved to: {output_concatenated_path}") + return output_concatenated_path + except Exception as e: + raise RuntimeError(f"Error saving concatenated audio to {output_concatenated_path}: {e}") + + def create_zip_archive( + self, + segment_file_paths: List[Path], + concatenated_audio_path: Path, + output_zip_path: Path + ) -> Path: + """ + Creates a ZIP archive containing individual audio segments and the concatenated audio file. + + Args: + segment_file_paths: A list of paths to the individual audio segment files. + concatenated_audio_path: Path to the final concatenated audio file. + output_zip_path: The path to save the output ZIP archive. + + Returns: + The path to the created ZIP archive. + """ + output_zip_path.parent.mkdir(parents=True, exist_ok=True) + + with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zf: + # Add concatenated audio + if concatenated_audio_path.exists(): + zf.write(concatenated_audio_path, arcname=concatenated_audio_path.name) + else: + print(f"Warning: Concatenated audio file {concatenated_audio_path} not found for zipping.") + + # Add individual segments + segments_dir_name = "segments" + for file_path in segment_file_paths: + if file_path.exists() and file_path.is_file(): + # Store segments in a subdirectory within the zip for organization + zf.write(file_path, arcname=Path(segments_dir_name) / file_path.name) + else: + print(f"Warning: Segment file {file_path} not found or is not a file. Skipping for zipping.") + + print(f"ZIP archive created at: {output_zip_path}") + return output_zip_path + +# Example Usage (Test Block) +if __name__ == "__main__": + import tempfile + import shutil + + # Create a temporary directory for test files + test_temp_dir = Path(tempfile.mkdtemp(prefix="audio_manip_test_")) + print(f"Created temporary test directory: {test_temp_dir}") + + # Instance of the service + audio_service = AudioManipulationService() + + # --- Test Data Setup --- + # Create dummy audio files (e.g., short silences with different names) + dummy_sr = audio_service.sample_rate + segment1_path = test_temp_dir / "segment1_speech.wav" + segment2_path = test_temp_dir / "segment2_speech.wav" + + torchaudio.save(str(segment1_path), audio_service._create_silence(1.0), dummy_sr) + # Create a dummy segment with a different sample rate to test resampling + dummy_sr_alt = 16000 + temp_waveform_alt_sr = torch.rand((1, int(0.5 * dummy_sr_alt))) # 0.5s at 16kHz + torchaudio.save(str(segment2_path), temp_waveform_alt_sr, dummy_sr_alt) + + segment_results_for_concat = [ + {"type": "speech", "path": str(segment1_path), "speaker_id": "spk1", "text_chunk": "Test 1"}, + {"type": "silence", "duration": 0.5}, + {"type": "speech", "path": str(segment2_path), "speaker_id": "spk2", "text_chunk": "Test 2 (alt SR)"}, + {"type": "error", "message": "Simulated error, should be skipped"}, + {"type": "speech", "path": "non_existent_segment.wav"}, # Test non-existent file + {"type": "silence", "duration": -0.2} # Test invalid duration + ] + + concatenated_output_path = test_temp_dir / "final_concatenated_audio.wav" + zip_output_path = test_temp_dir / "audio_archive.zip" + + all_segment_files_for_zip = [segment1_path, segment2_path] + + try: + # Test concatenation + print("\n--- Testing Concatenation ---") + actual_concat_path = audio_service.concatenate_audio_segments( + segment_results_for_concat, + concatenated_output_path + ) + print(f"Concatenation test successful. Output: {actual_concat_path}") + assert actual_concat_path.exists() + # Basic check: load concatenated and verify duration (approx) + concat_wav, concat_sr = audio_service._load_audio(actual_concat_path) + expected_duration = 1.0 + 0.5 + 0.5 # seg1 (1.0s) + silence (0.5s) + seg2 (0.5s) = 2.0s + actual_duration = concat_wav.shape[1] / concat_sr + print(f"Expected duration (approx): {expected_duration}s, Actual duration: {actual_duration:.2f}s") + assert abs(actual_duration - expected_duration) < 0.1 # Allow small deviation + + # Test Zipping + print("\n--- Testing Zipping ---") + actual_zip_path = audio_service.create_zip_archive( + all_segment_files_for_zip, + actual_concat_path, + zip_output_path + ) + print(f"Zipping test successful. Output: {actual_zip_path}") + assert actual_zip_path.exists() + # Verify zip contents (basic check) + segments_dir_name = "segments" # Define this for the assertion below + with zipfile.ZipFile(actual_zip_path, 'r') as zf_read: + zip_contents = zf_read.namelist() + print(f"ZIP contents: {zip_contents}") + assert Path(segments_dir_name) / segment1_path.name in [Path(p) for p in zip_contents] + assert Path(segments_dir_name) / segment2_path.name in [Path(p) for p in zip_contents] + assert concatenated_output_path.name in zip_contents + + print("\nAll AudioManipulationService tests passed!") + + except Exception as e: + import traceback + print(f"\nAn error occurred during AudioManipulationService tests:") + traceback.print_exc() + finally: + # Clean up temporary directory + # shutil.rmtree(test_temp_dir) + # print(f"Cleaned up temporary test directory: {test_temp_dir}") + print(f"Test files are in {test_temp_dir}. Please inspect and delete manually if needed.") diff --git a/backend/app/services/dialog_processor_service.py b/backend/app/services/dialog_processor_service.py new file mode 100644 index 0000000..050e5b6 --- /dev/null +++ b/backend/app/services/dialog_processor_service.py @@ -0,0 +1,265 @@ +from pathlib import Path +from typing import List, Dict, Any, Union +import re + +from .tts_service import TTSService +from .speaker_service import SpeakerManagementService +from app import config +# Potentially models for dialog structure if we define them +# from ..models.dialog_models import DialogItem # Example + +class DialogProcessorService: + def __init__(self, tts_service: TTSService, speaker_service: SpeakerManagementService): + self.tts_service = tts_service + self.speaker_service = speaker_service + # Base directory for storing individual audio segments during processing + self.temp_audio_dir = config.TTS_TEMP_OUTPUT_DIR + self.temp_audio_dir.mkdir(parents=True, exist_ok=True) + + def _split_text(self, text: str, max_length: int = 300) -> List[str]: + """ + Splits text into chunks suitable for TTS processing, attempting to respect sentence boundaries. + Similar to split_text_at_sentence_boundaries from the original Gradio app. + Max_length is approximate, as it tries to finish sentences. + """ + # Basic sentence splitting using common delimiters. More sophisticated NLP could be used. + # This regex tries to split by '.', '!', '?', '...', followed by space or end of string. + # It also handles cases where these delimiters might be followed by quotes or parentheses. + sentences = re.split(r'(?<=[.!?\u2026])\s+|(?<=[.!?\u2026])(?=["\')\]\}\u201d\u2019])|(?<=[.!?\u2026])$', text.strip()) + sentences = [s.strip() for s in sentences if s and s.strip()] + + chunks = [] + current_chunk = "" + for sentence in sentences: + if not sentence: + continue + if not current_chunk: # First sentence for this chunk + current_chunk = sentence + elif len(current_chunk) + len(sentence) + 1 <= max_length: + current_chunk += " " + sentence + else: + chunks.append(current_chunk) + current_chunk = sentence + + if current_chunk: # Add the last chunk + chunks.append(current_chunk) + + # Further split any chunks that are still too long (e.g., a single very long sentence) + final_chunks = [] + for chunk in chunks: + if len(chunk) > max_length: + # Simple split by length if a sentence itself is too long + for i in range(0, len(chunk), max_length): + final_chunks.append(chunk[i:i+max_length]) + else: + final_chunks.append(chunk) + return final_chunks + + async def process_dialog(self, dialog_items: List[Dict[str, Any]], output_base_name: str) -> Dict[str, Any]: + """ + Processes a list of dialog items (speech or silence) to generate audio segments. + + Args: + dialog_items: A list of dictionaries, where each item has: + - 'type': 'speech' or 'silence' + - For 'speech': 'speaker_id': str, 'text': str + - For 'silence': 'duration': float (in seconds) + output_base_name: The base name for the output files. + + Returns: + A dictionary containing paths to generated segments and other processing info. + Example: { + "log": "Processing complete...", + "segment_files": [ + {"type": "speech", "path": "/path/to/segment1.wav", "speaker_id": "X", "text_chunk": "..."}, + {"type": "silence", "duration": 0.5}, + {"type": "speech", "path": "/path/to/segment2.wav", "speaker_id": "Y", "text_chunk": "..."} + ], + "temp_dir": str(self.temp_audio_dir / output_base_name) + } + """ + segment_results = [] + processing_log = [] + + # Create a unique subdirectory for this dialog's temporary files + dialog_temp_dir = self.temp_audio_dir / output_base_name + dialog_temp_dir.mkdir(parents=True, exist_ok=True) + processing_log.append(f"Created temporary directory for segments: {dialog_temp_dir}") + + segment_idx = 0 + for i, item in enumerate(dialog_items): + item_type = item.get("type") + processing_log.append(f"Processing item {i+1}: type='{item_type}'") + + if item_type == "speech": + speaker_id = item.get("speaker_id") + text = item.get("text") + if not speaker_id or not text: + processing_log.append(f"Skipping speech item {i+1} due to missing speaker_id or text.") + segment_results.append({"type": "error", "message": "Missing speaker_id or text"}) + continue + + # Validate speaker_id and get speaker_sample_path + speaker_info = self.speaker_service.get_speaker_by_id(speaker_id) + if not speaker_info: + processing_log.append(f"Speaker ID '{speaker_id}' not found. Skipping item {i+1}.") + segment_results.append({"type": "error", "message": f"Speaker ID '{speaker_id}' not found"}) + continue + if not speaker_info.sample_path: + processing_log.append(f"Speaker ID '{speaker_id}' has no sample path defined. Skipping item {i+1}.") + segment_results.append({"type": "error", "message": f"Speaker ID '{speaker_id}' has no sample path defined"}) + continue + + # speaker_info.sample_path is relative to config.SPEAKER_DATA_BASE_DIR + abs_speaker_sample_path = config.SPEAKER_DATA_BASE_DIR / speaker_info.sample_path + if not abs_speaker_sample_path.is_file(): + processing_log.append(f"Speaker sample file not found or is not a file at '{abs_speaker_sample_path}' for speaker ID '{speaker_id}'. Skipping item {i+1}.") + segment_results.append({"type": "error", "message": f"Speaker sample not a file or not found: {abs_speaker_sample_path}"}) + continue + + text_chunks = self._split_text(text) + processing_log.append(f"Split text for speaker '{speaker_id}' into {len(text_chunks)} chunk(s).") + + for chunk_idx, text_chunk in enumerate(text_chunks): + segment_filename_base = f"{output_base_name}_seg{segment_idx}_spk{speaker_id}_chunk{chunk_idx}" + processing_log.append(f"Generating speech for chunk: '{text_chunk[:50]}...' using speaker '{speaker_id}'") + + try: + segment_output_path = await self.tts_service.generate_speech( + text=text_chunk, + speaker_id=speaker_id, # For metadata, actual sample path is used by TTS + speaker_sample_path=str(abs_speaker_sample_path), + output_filename_base=segment_filename_base, + output_dir=dialog_temp_dir, # Save to the dialog's temp dir + exaggeration=item.get('exaggeration', 0.5), # Default from Gradio, Pydantic model should provide this + cfg_weight=item.get('cfg_weight', 0.5), # Default from Gradio, Pydantic model should provide this + temperature=item.get('temperature', 0.8) # Default from Gradio, Pydantic model should provide this + ) + segment_results.append({ + "type": "speech", + "path": str(segment_output_path), + "speaker_id": speaker_id, + "text_chunk": text_chunk + }) + processing_log.append(f"Successfully generated segment: {segment_output_path}") + except Exception as e: + error_message = f"Error generating speech for chunk '{text_chunk[:50]}...': {repr(e)}" + processing_log.append(error_message) + segment_results.append({"type": "error", "message": error_message, "text_chunk": text_chunk}) + segment_idx += 1 + + elif item_type == "silence": + duration = item.get("duration") + if duration is None or duration < 0: + processing_log.append(f"Skipping silence item {i+1} due to invalid duration.") + segment_results.append({"type": "error", "message": "Invalid duration for silence"}) + continue + segment_results.append({"type": "silence", "duration": float(duration)}) + processing_log.append(f"Added silence of {duration}s.") + + else: + processing_log.append(f"Unknown item type '{item_type}' at item {i+1}. Skipping.") + segment_results.append({"type": "error", "message": f"Unknown item type: {item_type}"}) + + return { + "log": "\n".join(processing_log), + "segment_files": segment_results, + "temp_dir": str(dialog_temp_dir) # For cleanup or zipping later + } + +if __name__ == "__main__": + import asyncio + import pprint + + async def main_test(): + # Initialize services + tts_service = TTSService(device="mps") # or your preferred device + speaker_service = SpeakerManagementService() + dialog_processor = DialogProcessorService(tts_service, speaker_service) + + # Ensure dummy speaker sample exists (TTSService test block usually creates this) + # For robustness, we can call the TTSService test logic or ensure it's run prior. + # Here, we assume dummy_speaker_test.wav is available as per previous steps. + # If not, the 'test_speaker_for_dialog_proc' will fail file validation. + + # First, ensure the dummy speaker file is created by TTSService's own test logic + # This is a bit of a hack for testing; ideally, test assets are managed independently. + try: + print("Ensuring dummy speaker sample is created by running TTSService's main_test logic...") + from .tts_service import main_test as tts_main_test + await tts_main_test() # This will create the dummy_speaker_test.wav + print("TTSService main_test completed, dummy sample should exist.") + except ImportError: + print("Could not import tts_service.main_test directly. Ensure dummy_speaker_test.wav exists.") + except Exception as e: + print(f"Error running tts_service.main_test for dummy sample creation: {e}") + print("Proceeding, but 'test_speaker_for_dialog_proc' might fail if sample is missing.") + + sample_dialog_items = [ + { + "type": "speech", + "speaker_id": "test_speaker_for_dialog_proc", # Defined in speakers.yaml + "text": "Hello world! This is the first speech segment." + }, + { + "type": "silence", + "duration": 0.75 + }, + { + "type": "speech", + "speaker_id": "test_speaker_for_dialog_proc", + "text": "This is a much longer piece of text that should definitely be split into multiple, smaller chunks by the dialog processor. It contains several sentences. Let's see how it handles this. The maximum length is set to 300 characters, but it tries to respect sentence boundaries. This sentence itself is quite long and might even be split mid-sentence if it exceeds the hard limit after sentence splitting. We will observe the output carefully to ensure it works as expected, creating multiple audio files for this single text block if necessary." + }, + { + "type": "speech", + "speaker_id": "non_existent_speaker_id", + "text": "This should fail because the speaker does not exist." + }, + { + "type": "invalid_type", + "text": "This item has an invalid type." + }, + { + "type": "speech", + "speaker_id": "test_speaker_for_dialog_proc", + "text": None # Test missing text + }, + { + "type": "speech", + "speaker_id": None, # Test missing speaker_id + "text": "This is a test with a missing speaker ID." + }, + { + "type": "silence", + "duration": -0.5 # Invalid duration + } + ] + + output_base_name = "dialog_processor_test_run" + + try: + print(f"\nLoading TTS model for DialogProcessorService test...") + # TTSService's generate_speech will load the model if not already loaded. + # However, explicit load/unload is good practice for a test block. + tts_service.load_model() + + print(f"\nProcessing dialog items with base name: {output_base_name}...") + results = await dialog_processor.process_dialog(sample_dialog_items, output_base_name) + + print("\n--- Processing Log ---") + print(results.get("log")) + print("\n--- Segment Files / Results ---") + pprint.pprint(results.get("segment_files")) + print(f"\nTemporary directory used: {results.get('temp_dir')}") + print("\nPlease check the temporary directory for generated audio segments.") + + except Exception as e: + import traceback + print(f"\nAn error occurred during the DialogProcessorService test:") + traceback.print_exc() + finally: + print("\nUnloading TTS model...") + tts_service.unload_model() + print("DialogProcessorService test finished.") + + asyncio.run(main_test()) diff --git a/backend/app/services/speaker_service.py b/backend/app/services/speaker_service.py new file mode 100644 index 0000000..b72dc6a --- /dev/null +++ b/backend/app/services/speaker_service.py @@ -0,0 +1,147 @@ +import yaml +import uuid +import os +import io # Added for BytesIO +import torchaudio # Added for audio processing +from pathlib import Path +from typing import List, Dict, Optional, Any + +from fastapi import UploadFile, HTTPException +from app.models.speaker_models import Speaker, SpeakerCreate +from app import config + +class SpeakerManagementService: + def __init__(self): + self._ensure_data_files_exist() + self.speakers_data = self._load_speakers_data() + + def _ensure_data_files_exist(self): + """Ensures the speaker data directory and YAML file exist.""" + config.SPEAKER_DATA_BASE_DIR.mkdir(parents=True, exist_ok=True) + config.SPEAKER_SAMPLES_DIR.mkdir(parents=True, exist_ok=True) + if not config.SPEAKERS_YAML_FILE.exists(): + with open(config.SPEAKERS_YAML_FILE, 'w') as f: + yaml.dump({}, f) # Initialize with an empty dict, as per previous fixes + + def _load_speakers_data(self) -> Dict[str, Any]: # Changed return type to Dict + """Loads speaker data from the YAML file.""" + try: + with open(config.SPEAKERS_YAML_FILE, 'r') as f: + data = yaml.safe_load(f) + return data if isinstance(data, dict) else {} # Ensure it's a dict + except FileNotFoundError: + return {} + except yaml.YAMLError: + # Handle corrupted YAML file, e.g., log error and return empty list + print(f"Error: Corrupted speakers YAML file at {config.SPEAKERS_YAML_FILE}") + return {} + + + def _save_speakers_data(self): + """Saves the current speaker data to the YAML file.""" + with open(config.SPEAKERS_YAML_FILE, 'w') as f: + yaml.dump(self.speakers_data, f, sort_keys=False) + + def get_speakers(self) -> List[Speaker]: + """Returns a list of all speakers.""" + # self.speakers_data is now a dict: {speaker_id: {name: ..., sample_path: ...}} + return [Speaker(id=spk_id, **spk_attrs) for spk_id, spk_attrs in self.speakers_data.items()] + + def get_speaker_by_id(self, speaker_id: str) -> Optional[Speaker]: + """Retrieves a speaker by their ID.""" + if speaker_id in self.speakers_data: + speaker_attributes = self.speakers_data[speaker_id] + return Speaker(id=speaker_id, **speaker_attributes) + return None + + async def add_speaker(self, name: str, audio_file: UploadFile) -> Speaker: + """Adds a new speaker, converts sample to WAV, saves it, and updates YAML.""" + speaker_id = str(uuid.uuid4()) + + # Define standardized sample filename and path (always WAV) + sample_filename = f"{speaker_id}.wav" + sample_path = config.SPEAKER_SAMPLES_DIR / sample_filename + + try: + content = await audio_file.read() + # Use BytesIO to handle the in-memory audio data for torchaudio + audio_buffer = io.BytesIO(content) + + # Load audio data using torchaudio, this handles various formats (MP3, WAV, etc.) + # waveform is a tensor, sample_rate is an int + waveform, sample_rate = torchaudio.load(audio_buffer) + + # Save the audio data as WAV + # Ensure the SPEAKER_SAMPLES_DIR exists (though _ensure_data_files_exist should handle it) + config.SPEAKER_SAMPLES_DIR.mkdir(parents=True, exist_ok=True) + torchaudio.save(str(sample_path), waveform, sample_rate, format="wav") + + except torchaudio.TorchaudioException as e: + # More specific error for torchaudio issues (e.g. unsupported format, corrupted file) + raise HTTPException(status_code=400, detail=f"Error processing audio file: {e}. Ensure it's a valid audio format (e.g., WAV, MP3).") + except Exception as e: + # General error handling for other issues (e.g., file system errors) + raise HTTPException(status_code=500, detail=f"Could not save audio file: {e}") + finally: + await audio_file.close() + + new_speaker_data = { + "id": speaker_id, + "name": name, + "sample_path": str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR)) # Store path relative to speaker_data dir + } + + # self.speakers_data is now a dict + self.speakers_data[speaker_id] = { + "name": name, + "sample_path": str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR)) + } + self._save_speakers_data() + # Construct Speaker model for return, including the ID + return Speaker(id=speaker_id, name=name, sample_path=str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR))) + + def delete_speaker(self, speaker_id: str) -> bool: + """Deletes a speaker and their audio sample.""" + # Speaker data is now a dictionary, keyed by speaker_id + speaker_to_delete = self.speakers_data.pop(speaker_id, None) + + if speaker_to_delete: + self._save_speakers_data() + sample_path_str = speaker_to_delete.get("sample_path") + if sample_path_str: + # sample_path_str is relative to SPEAKER_DATA_BASE_DIR + full_sample_path = config.SPEAKER_DATA_BASE_DIR / sample_path_str + try: + if full_sample_path.is_file(): # Check if it's a file before removing + os.remove(full_sample_path) + except OSError as e: + # Log error if file deletion fails but proceed + print(f"Error deleting sample file {full_sample_path}: {e}") + return True + return False + +# Example usage (for testing, not part of the service itself) +if __name__ == "__main__": + service = SpeakerManagementService() + print("Initial speakers:", service.get_speakers()) + + # This part would require a mock UploadFile to run directly + # print("\nAdding a new speaker (manual test setup needed for UploadFile)") + # class MockUploadFile: + # def __init__(self, filename, content): + # self.filename = filename + # self._content = content + # async def read(self): return self._content + # async def close(self): pass + # import asyncio + # async def test_add(): + # mock_file = MockUploadFile("test.wav", b"dummy audio content") + # new_speaker = await service.add_speaker(name="Test Speaker", audio_file=mock_file) + # print("\nAdded speaker:", new_speaker) + # print("Speakers after add:", service.get_speakers()) + # return new_speaker.id + # speaker_id_to_delete = asyncio.run(test_add()) + # if speaker_id_to_delete: + # print(f"\nDeleting speaker {speaker_id_to_delete}") + # service.delete_speaker(speaker_id_to_delete) + # print("Speakers after delete:", service.get_speakers()) diff --git a/backend/app/services/tts_service.py b/backend/app/services/tts_service.py new file mode 100644 index 0000000..266dd1c --- /dev/null +++ b/backend/app/services/tts_service.py @@ -0,0 +1,155 @@ +import torch +import torchaudio +from typing import Optional +from chatterbox.tts import ChatterboxTTS +from pathlib import Path +import gc # Garbage collector for memory management + +# Define a directory for TTS model outputs, could be temporary or configurable +TTS_OUTPUT_DIR = Path("/Volumes/SAM2/CODE/chatterbox-test/tts_outputs") # Example path + +class TTSService: + def __init__(self, device: str = "mps"): # Default to MPS for Macs, can be "cpu" or "cuda" + self.device = device + self.model = None + self._ensure_output_dir_exists() + + def _ensure_output_dir_exists(self): + """Ensures the TTS output directory exists.""" + TTS_OUTPUT_DIR.mkdir(parents=True, exist_ok=True) + + def load_model(self): + """Loads the ChatterboxTTS model.""" + if self.model is None: + print(f"Loading ChatterboxTTS model to device: {self.device}...") + try: + self.model = ChatterboxTTS.from_pretrained(device=self.device) + print("ChatterboxTTS model loaded successfully.") + except Exception as e: + print(f"Error loading ChatterboxTTS model: {e}") + # Potentially raise an exception or handle appropriately + raise + else: + print("ChatterboxTTS model already loaded.") + + def unload_model(self): + """Unloads the model and clears memory.""" + if self.model is not None: + print("Unloading ChatterboxTTS model and clearing cache...") + del self.model + self.model = None + if self.device == "cuda": + torch.cuda.empty_cache() + elif self.device == "mps": + if hasattr(torch.mps, "empty_cache"): # Check if empty_cache is available for MPS + torch.mps.empty_cache() + gc.collect() # Explicitly run garbage collection + print("Model unloaded and memory cleared.") + + async def generate_speech( + self, + text: str, + speaker_sample_path: str, # Absolute path to the speaker's audio sample + output_filename_base: str, # e.g., "dialog_line_1_spk_X_chunk_0" + speaker_id: Optional[str] = None, # Optional, mainly for logging if needed, filename base is primary + output_dir: Optional[Path] = None, # Optional, defaults to TTS_OUTPUT_DIR from this module + exaggeration: float = 0.5, # Default from Gradio + cfg_weight: float = 0.5, # Default from Gradio + temperature: float = 0.8, # Default from Gradio + ) -> Path: + """ + Generates speech from text using the loaded TTS model and a speaker sample. + Saves the output to a .wav file. + """ + if self.model is None: + self.load_model() + + if self.model is None: # Check again if loading failed + raise RuntimeError("TTS model is not loaded. Cannot generate speech.") + + # Ensure speaker_sample_path is valid + speaker_sample_p = Path(speaker_sample_path) + if not speaker_sample_p.exists() or not speaker_sample_p.is_file(): + raise FileNotFoundError(f"Speaker sample audio file not found: {speaker_sample_path}") + + target_output_dir = output_dir if output_dir is not None else TTS_OUTPUT_DIR + target_output_dir.mkdir(parents=True, exist_ok=True) + # output_filename_base from DialogProcessorService is expected to be comprehensive (e.g., includes speaker_id, segment info) + output_file_path = target_output_dir / f"{output_filename_base}.wav" + + print(f"Generating audio for text: \"{text[:50]}...\" with speaker sample: {speaker_sample_path}") + try: + with torch.no_grad(): # Important for inference + wav = self.model.generate( + text=text, + audio_prompt_path=str(speaker_sample_p), # Must be a string path + exaggeration=exaggeration, + cfg_weight=cfg_weight, + temperature=temperature, + ) + + torchaudio.save(str(output_file_path), wav, self.model.sr) + print(f"Audio saved to: {output_file_path}") + return output_file_path + except Exception as e: + print(f"Error during TTS generation or saving: {e}") + raise + finally: + # For now, we keep it loaded. Memory management might need refinement. + pass + +# Example usage (for testing, not part of the service itself) +if __name__ == "__main__": + async def main_test(): + tts_service = TTSService(device="mps") + try: + tts_service.load_model() + + dummy_speaker_root = Path("/Volumes/SAM2/CODE/chatterbox-test/speaker_data/speaker_samples") + dummy_speaker_root.mkdir(parents=True, exist_ok=True) + dummy_sample_file = dummy_speaker_root / "dummy_speaker_test.wav" + import os # Added for os.remove + # Always try to remove an existing dummy file to ensure a fresh one is created + if dummy_sample_file.exists(): + try: + os.remove(dummy_sample_file) + print(f"Removed existing dummy sample: {dummy_sample_file}") + except OSError as e: + print(f"Error removing existing dummy sample {dummy_sample_file}: {e}") + # Proceeding, but torchaudio.save might fail or overwrite + + print(f"Creating new dummy speaker sample: {dummy_sample_file}") + # Create a minimal, silent WAV file for testing + sample_rate = 22050 + duration = 1 # seconds + num_channels = 1 + num_frames = sample_rate * duration + audio_data = torch.zeros((num_channels, num_frames)) + try: + torchaudio.save(str(dummy_sample_file), audio_data, sample_rate) + print(f"Dummy sample created successfully: {dummy_sample_file}") + except Exception as save_e: + print(f"Could not create dummy sample: {save_e}") + # If creation fails, the subsequent generation test will likely also fail or be skipped. + + + if dummy_sample_file.exists(): + output_path = await tts_service.generate_speech( + text="Hello, this is a test of the Text-to-Speech service.", + speaker_id="test_speaker", + speaker_sample_path=str(dummy_sample_file), + output_filename_base="test_generation" + ) + print(f"Test generation output: {output_path}") + else: + print(f"Skipping generation test as dummy sample {dummy_sample_file} not found.") + + except Exception as e: + import traceback + print(f"Error during TTS generation or saving:") + traceback.print_exc() + finally: + tts_service.unload_model() + + import asyncio + asyncio.run(main_test()) \ No newline at end of file diff --git a/backend/requirements.txt b/backend/requirements.txt new file mode 100644 index 0000000..082887f --- /dev/null +++ b/backend/requirements.txt @@ -0,0 +1,7 @@ +fastapi +uvicorn[standard] +python-multipart +PyYAML +torch +torchaudio +chatterbox-tts diff --git a/backend/run_api_test.py b/backend/run_api_test.py new file mode 100644 index 0000000..f993d65 --- /dev/null +++ b/backend/run_api_test.py @@ -0,0 +1,108 @@ +import requests +import json +from pathlib import Path +import time + +# Configuration +API_BASE_URL = "http://localhost:8000/api/dialog" +ENDPOINT_URL = f"{API_BASE_URL}/generate" + +# Define project root relative to this test script (assuming it's in backend/) +PROJECT_ROOT = Path(__file__).resolve().parent +GENERATED_DIALOGS_DIR = PROJECT_ROOT / "tts_generated_dialogs" + +DIALOG_PAYLOAD = { + "output_base_name": "test_dialog_from_script", + "dialog_items": [ + { + "type": "speech", + "speaker_id": "dummy_speaker", # Ensure this speaker exists in your speakers.yaml and has a sample .wav + "text": "This is a test from the Python script. One, two, three.", + "exaggeration": 1.5, + "cfg_weight": 4.0, + "temperature": 0.5 + }, + { + "type": "silence", + "duration": 0.5 + }, + { + "type": "speech", + "speaker_id": "dummy_speaker", + "text": "Testing complete. All systems nominal." + }, + { + "type": "speech", + "speaker_id": "non_existent_speaker", # Test case for invalid speaker + "text": "This should produce an error for this segment." + }, + { + "type": "silence", + "duration": 0.25 # Changed to valid duration + } + ] +} + +def run_test(): + print(f"Sending POST request to: {ENDPOINT_URL}") + print("Payload:") + print(json.dumps(DIALOG_PAYLOAD, indent=2)) + print("-" * 50) + + try: + start_time = time.time() + response = requests.post(ENDPOINT_URL, json=DIALOG_PAYLOAD, timeout=120) # Increased timeout for TTS processing + end_time = time.time() + + print(f"Response received in {end_time - start_time:.2f} seconds.") + print(f"Status Code: {response.status_code}") + print("-" * 50) + + if response.content: + try: + response_data = response.json() + print("Response JSON:") + print(json.dumps(response_data, indent=2)) + print("-" * 50) + + if response.status_code == 200: + print("Test PASSED (HTTP 200 OK)") + concatenated_url = response_data.get("concatenated_audio_url") + zip_url = response_data.get("zip_archive_url") + temp_dir = response_data.get("temp_dir_path") + + if concatenated_url: + print(f"Concatenated audio URL: http://localhost:8000{concatenated_url}") + if zip_url: + print(f"ZIP archive URL: http://localhost:8000{zip_url}") + if temp_dir: + print(f"Temporary segment directory: {temp_dir}") + + print("\nTo verify, check the generated files in:") + print(f" Concatenated/ZIP: {GENERATED_DIALOGS_DIR}") + print(f" Individual segments (if not cleaned up): {temp_dir}") + else: + print(f"Test FAILED (HTTP {response.status_code})") + if response_data.get("detail"): + print(f"Error Detail: {response_data.get('detail')}") + + except json.JSONDecodeError: + print("Response content is not valid JSON:") + print(response.text) + print("Test FAILED (Invalid JSON Response)") + else: + print("Response content is empty.") + print(f"Test FAILED (Empty Response, HTTP {response.status_code})") + + except requests.exceptions.ConnectionError as e: + print(f"Connection Error: {e}") + print("Test FAILED (Could not connect to the server. Is it running?)") + except requests.exceptions.Timeout as e: + print(f"Request Timeout: {e}") + print("Test FAILED (The request timed out. TTS processing might be too slow or stuck.)") + except Exception as e: + print(f"An unexpected error occurred: {e}") + print("Test FAILED (Unexpected error)") + +if __name__ == "__main__": + run_test() diff --git a/frontend/css/style.css b/frontend/css/style.css index af95928..aa364a7 100644 --- a/frontend/css/style.css +++ b/frontend/css/style.css @@ -1,74 +1,255 @@ -/* Basic styles - to be expanded */ +/* Modern, clean, and accessible UI styles for Chatterbox TTS */ body { - font-family: sans-serif; - line-height: 1.6; + font-family: 'Segoe UI', 'Roboto', 'Arial', sans-serif; + line-height: 1.7; margin: 0; padding: 0; - background-color: #f4f4f4; - color: #333; + background-color: #f7f9fa; + color: #222; +} + +.container { + max-width: 1100px; + margin: 0 auto; + padding: 0 18px; } header { - background: #333; + background: #222e3a; color: #fff; - padding: 1rem 0; + padding: 1.5rem 0 1rem 0; text-align: center; + border-bottom: 3px solid #4a90e2; +} + +h1 { + font-size: 2.4rem; + margin: 0; + letter-spacing: 1px; } main { - padding: 20px; - max-width: 960px; - margin: auto; + margin-top: 30px; + margin-bottom: 30px; +} + +.panel-grid { + display: flex; + flex-wrap: wrap; + gap: 28px; + justify-content: space-between; +} + +.panel { + flex: 1 1 320px; + min-width: 320px; + background: none; + box-shadow: none; + border: none; + padding: 0; +} + + +#results-display.panel { + flex: 1 1 100%; + min-width: 0; + margin-top: 32px; +} + +/* Dialog Table Styles */ +#dialog-items-table { + width: 100%; + border-collapse: collapse; + background: #fff; + border-radius: 8px; + overflow: hidden; + font-size: 1rem; + margin-bottom: 0; +} +#dialog-items-table th, #dialog-items-table td { + padding: 10px 12px; + border-bottom: 1px solid #e3e3e3; + text-align: left; +} +#dialog-items-table th { + background: #f3f7fa; + color: #4a90e2; + font-weight: 600; + font-size: 1.05rem; +} +#dialog-items-table tr:last-child td { + border-bottom: none; +} +#dialog-items-table td.actions { + text-align: center; + min-width: 90px; +} + +/* Collapsible log details */ +details#generation-log-details { + margin-bottom: 0; + border-radius: 4px; + background: #f3f5f7; + box-shadow: 0 1px 3px rgba(44,62,80,0.04); + padding: 0 0 0 0; + transition: box-shadow 0.15s; +} +details#generation-log-details[open] { + box-shadow: 0 2px 8px rgba(44,62,80,0.07); + background: #f9fafb; +} +details#generation-log-details summary { + font-size: 1rem; + color: #357ab8; + padding: 10px 0 6px 0; + outline: none; +} +details#generation-log-details summary:focus { + outline: 2px solid #4a90e2; + border-radius: 3px; +} + +@media (max-width: 900px) { + .panel-grid { + display: block; + gap: 0; + } + .panel, .full-width-panel { + min-width: 0; + width: 100%; + flex: 1 1 100%; + } + #dialog-items-table th, #dialog-items-table td { + font-size: 0.97rem; + padding: 7px 8px; + } + #speaker-management.panel { + margin-bottom: 36px; + width: 100%; + max-width: 100%; + flex: 1 1 100%; + } +} + +.card { + background: #fff; + border-radius: 8px; + box-shadow: 0 2px 8px rgba(44,62,80,0.07); + padding: 18px 20px; + margin-bottom: 18px; } section { - background: #fff; - padding: 20px; - margin-bottom: 20px; - border-radius: 5px; + margin-bottom: 0; + border-radius: 0; + padding: 0; + background: none; } hr { - margin: 20px 0; - border: 0; - border-top: 1px solid #eee; + display: none; +} + +h2 { + font-size: 1.5rem; + margin-top: 0; + margin-bottom: 16px; + color: #4a90e2; + letter-spacing: 0.5px; +} + +h3 { + font-size: 1.1rem; + margin-bottom: 10px; + color: #333; +} + +.x-remove-btn { + background: #e74c3c; + color: #fff; + border: none; + border-radius: 50%; + width: 28px; + height: 28px; + font-size: 1.2rem; + line-height: 1; + display: inline-flex; + align-items: center; + justify-content: center; + cursor: pointer; + transition: background 0.15s; + margin: 0 2px; + box-shadow: 0 1px 2px rgba(44,62,80,0.06); + outline: none; + padding: 0; +} +.x-remove-btn:hover, .x-remove-btn:focus { + background: #c0392b; + color: #fff; + outline: 2px solid #e74c3c; +} + +.form-row { + display: flex; + align-items: center; + gap: 12px; + margin-bottom: 14px; +} + +label { + min-width: 120px; + font-weight: 500; + margin-bottom: 0; +} + +input[type='text'], input[type='file'] { + padding: 8px 10px; + border: 1px solid #cfd8dc; + border-radius: 4px; + font-size: 1rem; + width: 100%; + box-sizing: border-box; +} + +input[type='file'] { + background: #f7f7f7; + font-size: 0.97rem; } button { - padding: 10px 15px; - background: #333; + padding: 9px 18px; + background: #4a90e2; color: #fff; border: none; border-radius: 5px; cursor: pointer; - margin-right: 5px; /* Add some margin between buttons */ + font-size: 1rem; + font-weight: 500; + transition: background 0.15s; + margin-right: 10px; } -button:hover { - background: #555; +button:hover, button:focus { + background: #357ab8; + outline: none; } -input[type='text'], input[type='file'] { - padding: 8px; +.dialog-controls { margin-bottom: 10px; - border: 1px solid #ddd; - border-radius: 4px; - width: calc(100% - 20px); /* Adjust width considering padding */ -} - -label { - display: block; - margin-bottom: 5px; } #speaker-list { list-style: none; padding: 0; + margin: 0; } #speaker-list li { - padding: 5px 0; - border-bottom: 1px dotted #eee; + padding: 7px 0; + border-bottom: 1px solid #e3e3e3; + display: flex; + justify-content: space-between; + align-items: center; } #speaker-list li:last-child { @@ -76,17 +257,74 @@ label { } pre { - background: #eee; - padding: 10px; + background: #f3f5f7; + padding: 12px; border-radius: 4px; - white-space: pre-wrap; /* Allow wrapping */ - word-wrap: break-word; /* Break long words */ + font-size: 0.98rem; + white-space: pre-wrap; + word-wrap: break-word; + margin: 0; +} + +audio { + width: 100%; + margin-top: 8px; + margin-bottom: 8px; +} + +#zip-archive-link { + display: inline-block; + margin-right: 10px; + color: #fff; + background: #4a90e2; + padding: 7px 16px; + border-radius: 4px; + text-decoration: none; + font-weight: 500; + transition: background 0.15s; +} + +#zip-archive-link:hover, #zip-archive-link:focus { + background: #357ab8; } footer { text-align: center; - padding: 20px; - background: #333; + padding: 20px 0; + background: #222e3a; color: #fff; - margin-top: 30px; + margin-top: 40px; + font-size: 1rem; + border-top: 3px solid #4a90e2; +} + +@media (max-width: 900px) { + .panel-grid { + flex-direction: column; + gap: 22px; + } + .panel { + min-width: 0; + } +} + +/* Simple side-by-side layout for speaker management */ +.speaker-mgmt-row { + display: flex; + gap: 20px; +} + +.speaker-mgmt-row .card { + flex: 1; + width: 50%; +} + +/* Stack on mobile */ +@media (max-width: 768px) { + .speaker-mgmt-row { + flex-direction: column; + } + .speaker-mgmt-row .card { + width: 100%; + } } diff --git a/frontend/index.html b/frontend/index.html index 4b8472c..bb7fdb6 100644 --- a/frontend/index.html +++ b/frontend/index.html @@ -8,77 +8,92 @@
Type | +Speaker | +Text / Duration | +Actions | +
---|
(Generation log will appear here)+
(ZIP download link will appear here)
+