Compare commits

...

6 Commits

36 changed files with 8425 additions and 0 deletions

6
.gitignore vendored
View File

@ -5,3 +5,9 @@ output*.wav
*.mp3 *.mp3
dialog_output/ dialog_output/
*.zip *.zip
.DS_Store
__pycache__
projects/
# Node.js dependencies
node_modules/

32
.note/code_structure.md Normal file
View File

@ -0,0 +1,32 @@
# Code Structure
*(This document will describe the organization of the codebase as it evolves.)*
## Current (Gradio-based - to be migrated)
- `gradio_app.py`: Main application logic for the Gradio UI.
- `requirements.txt`: Python dependencies.
- `speaker_samples/`: Directory for speaker audio samples.
- `speakers.yaml`: Configuration for speakers.
- `single_output/`: Output directory for single utterance TTS.
- `dialog_output/`: Output directory for dialog TTS.
## Planned (FastAPI + Vanilla JS)
### Backend (FastAPI - Python)
- `main.py`: FastAPI application entry point, router setup.
- `api/`: Directory for API endpoint modules (e.g., `tts_routes.py`, `speaker_routes.py`).
- `core/`: Core logic (e.g., TTS processing, dialog assembly, file management).
- `models/`: Pydantic models for request/response validation.
- `services/`: Business logic services (e.g., `TTSService`, `DialogService`).
- `static/` (or served via CDN): For frontend files if not using a separate frontend server during development.
### Frontend (Vanilla JavaScript)
- `index.html`: Main HTML file.
- `css/`: Stylesheets.
- `style.css`
- `js/`: JavaScript files.
- `app.js`: Main application logic.
- `api.js`: Functions for interacting with the FastAPI backend.
- `uiComponents.js`: Reusable UI components (e.g., DialogLine, AudioPlayer).
- `state.js`: Frontend state management (if needed).
- `assets/`: Static assets like images or icons.

23
.note/current_focus.md Normal file
View File

@ -0,0 +1,23 @@
# Chatterbox TTS Migration: Backend Development (FastAPI)
**Primary Goal:** Implement the FastAPI backend for TTS dialog generation.
**Recent Accomplishments (Phase 1, Step 2 - Speaker Management):**
- Created Pydantic models for speaker data (`speaker_models.py`).
- Implemented `SpeakerManagementService` (`speaker_service.py`) for CRUD operations on speakers (metadata in `speakers.yaml`, samples in `speaker_samples/`).
- Created FastAPI router (`routers/speakers.py`) with endpoints: `GET /api/speakers`, `POST /api/speakers`, `GET /api/speakers/{id}`, `DELETE /api/speakers/{id}`.
- Integrated speaker router into the main FastAPI app (`main.py`).
- Successfully tested all speaker API endpoints using `curl`.
**Current Task (Phase 1, Step 3 - TTS Core):**
- **Develop `TTSService` in `backend/app/services/tts_service.py`.**
- Focus on `ChatterboxTTS` model loading, inference, and critical memory management.
- Define methods for speech generation using speaker samples.
- Manage TTS parameters (exaggeration, cfg_weight, temperature).
**Next Immediate Steps:**
1. Finalize and test the initial implementation of `TTSService`.
2. Proceed to Phase 1, Step 4: Dialog Processing - Implement `DialogProcessorService` including text splitting logic.

22
.note/decision_log.md Normal file
View File

@ -0,0 +1,22 @@
# Decision Log
This log records key decisions made throughout the project, along with their rationale.
---
**Date:** 2025-06-05
**Decision ID:** 20250605-001
**Decision:** Adopt the `.note/` Memory Bank system for project documentation and context management.
**Rationale:** As per user's global development standards (MEMORY[user_global]) to ensure persistent knowledge and effective collaboration, especially given potential agent memory resets.
**Impact:** Creation of standard `.note/` files (`project_overview.md`, `current_focus.md`, etc.). All significant project information, decisions, and progress will be logged here.
---
**Date:** 2025-06-05
**Decision ID:** 20250605-002
**Decision:** Created a detailed migration plan for moving from Gradio to FastAPI & Vanilla JS.
**Rationale:** Based on a thorough review of `gradio_app.py` and the user's request, a detailed, phased plan was necessary to guide development. This incorporates key findings about TTS model management, text processing, and output requirements.
**Impact:** The plan is stored in `.note/detailed_migration_plan.md`. `current_focus.md` has been updated to reflect this. Development will follow this plan upon user approval.
**Related Memory:** MEMORY[b82cdf38-f0b9-45cd-8097-5b1b47030a40] (System memory of the plan)
---

View File

@ -0,0 +1,98 @@
# Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan
This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from `gradio_app.py` and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).
## 1. Backend (FastAPI) Development
### Objective
Create a robust API to handle TTS generation, speaker management, and file delivery.
### Key Modules/Components
* **API Endpoints:**
* `POST /api/dialog/generate`:
* **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`.
* **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`.
* `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`).
* `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`.
* `DELETE /api/speakers/{speaker_id}`: Removes a speaker.
* **Core Logic & Services:**
* `TTSService`:
* Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup).
* Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults).
* Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept).
* `DialogProcessorService`:
* Orchestrates dialog generation using `TTSService`.
* Implements `split_text_at_sentence_boundaries` logic for long text inputs.
* Manages generation of individual audio segments.
* `AudioManipulationService`:
* Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences.
* Creates ZIP archives of all generated audio files using `zipfile`.
* `SpeakerManagementService`:
* Manages `speakers.yaml` (or alternative storage) for speaker metadata.
* Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`).
* **File Handling:**
* Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage).
### Implementation Steps (Phase 1)
1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`).
2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints.
3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management.
4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting.
5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping.
6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services.
7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings.
8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`.
## 2. Frontend (Vanilla JavaScript) Development
### Objective
Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.
### Key Modules/Components
* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display.
* **CSS (`style.css`):** Styling for a clean and usable interface.
* **JavaScript (`app.js`, `api.js`, `ui.js`):
* `api.js`: Functions for all backend API communications (`fetch`).
* `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
* `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data).
### Implementation Steps (Phase 2)
1. **Basic Layout:** Create `index.html` and `style.css`.
2. **API Client:** Develop `api.js` to interface with all backend endpoints.
3. **Speaker UI:**
* Fetch and display speakers using `ui.js` and `api.js`.
* Implement forms and logic for adding (with file upload) and removing speakers.
4. **Dialog Editor UI:**
* Dynamically add/remove/reorder dialog lines (speech/silence).
* Inputs for speaker selection (populated from API), text, and silence duration.
* Input for `output_base_name`.
5. **Interaction & Results:**
* "Generate Dialog" button to submit data via `api.js`.
* Display generation log, audio player for concatenated output, and download link for ZIP file.
## 3. Integration & Testing (Phase 3)
1. **Full System Connection:** Ensure seamless frontend-backend communication.
2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions.
3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed.
4. **UX Refinement:** Iterate on UI/UX based on testing feedback.
## 4. Advanced Features & Deployment (Phase 4)
* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
* **Real-time Updates:** Consider WebSockets for live progress during generation.
* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets.
## Key Considerations from `gradio_app.py` Analysis
* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly.
* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated.
* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend.
* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API.
* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.

View File

@ -0,0 +1,21 @@
# Development Standards
*(To be defined. This document will outline coding conventions, patterns, and best practices for the project.)*
## General Principles
- **Clarity and Readability:** Code should be easy to understand and maintain.
- **Modularity:** Design components with clear responsibilities and interfaces.
- **Testability:** Write code that is easily testable.
## Python (FastAPI Backend)
- Follow PEP 8 style guidelines.
- Use type hints.
- Structure API endpoints logically.
## JavaScript (Vanilla JS Frontend)
- Follow modern JavaScript best practices (ES6+).
- Organize code into modules.
- Prioritize performance and responsiveness.
## Commit Messages
- Follow conventional commit message format (e.g., `feat: add new TTS feature`, `fix: resolve audio playback bug`).

88
.note/interfaces.md Normal file
View File

@ -0,0 +1,88 @@
# Component Interfaces
*(This document will define the interfaces between different components of the system, especially between the frontend and backend.)*
## Backend API (FastAPI)
*(To be detailed. Examples below)*
### `/api/tts/generate_single` (POST)
- **Request Body:**
```json
{
"text": "string",
"speaker_id": "string",
"temperature": "float (optional)",
"length_penalty": "float (optional)"
}
```
- **Response Body (Success):**
```json
{
"audio_url": "string (URL to the generated audio file)",
"duration_ms": "integer"
}
```
- **Response Body (Error):**
```json
{
"detail": "string (error message)"
}
```
### `/api/tts/generate_dialog` (POST)
- **Request Body:**
```json
{
"dialog_lines": [
{
"type": "speech", // or "silence"
"speaker_id": "string (required if type is speech)",
"text": "string (required if type is speech)",
"duration_s": "float (required if type is silence)"
}
],
"output_base_name": "string (optional)"
}
```
- **Response Body (Success):**
```json
{
"dialog_audio_url": "string (URL to the concatenated dialog audio file)",
"individual_files_zip_url": "string (URL to zip of individual lines)",
"total_duration_ms": "integer"
}
```
### `/api/speakers` (GET)
- **Response Body (Success):**
```json
[
{
"id": "string",
"name": "string",
"sample_url": "string (optional)"
}
]
```
### `/api/speakers` (POST)
- **Request Body:** (Multipart form-data)
- `name`: "string"
- `audio_sample`: file (WAV)
- **Response Body (Success):**
```json
{
"id": "string",
"name": "string",
"message": "Speaker added successfully"
}
```
## Frontend Components (Vanilla JS)
*(To be detailed as frontend development progresses.)*
- **DialogLine Component:** Manages input for a single line of dialog (speaker, text).
- **AudioPlayer Component:** Handles playback of generated audio.
- **ProjectManager Component:** Manages overall project state, dialog lines, and interaction with the backend.

42
.note/project_overview.md Normal file
View File

@ -0,0 +1,42 @@
# Project Overview: Chatterbox TTS Application Migration
## 1. Current System
The project is currently a Gradio-based application named "Chatterbox TTS Gradio App".
Its primary function is to provide a user interface for text-to-speech (TTS) generation using the Chatterbox TTS model.
Key features of the current Gradio application include:
- Single utterance TTS generation.
- Multi-speaker dialog generation with configurable silence gaps.
- Speaker management (adding/removing speakers with custom audio samples).
- Automatic memory optimization (model cleanup after generation).
- Organized output file storage (`single_output/` and `dialog_output/`).
## 2. Project Goal: Migration to Modern Web Stack
The primary goal of this project is to re-implement the Chatterbox TTS application, specifically its dialog generation capabilities, by migrating from the current Gradio framework to a new architecture.
The new architecture will consist of:
- **Frontend**: Vanilla JavaScript
- **Backend**: FastAPI (Python)
This migration aims to address limitations of the Gradio framework, such as audio playback issues, limited UI control, and state management complexity, and to provide a more robust, performant, and professional user experience.
## 3. High-Level Plan & Existing Documentation
A comprehensive implementation plan for this migration already exists and should be consulted. This plan (Memory ID c20c2cce-46d4-453f-9bc3-c18e05dbc66f) outlines:
- A 4-phase implementation (Backend API, Frontend Development, Integration & Testing, Production Features).
- The complete technical architecture.
- A detailed component system (DialogLine, AudioPlayer, ProjectManager).
- Features like real-time status updates and drag-and-drop functionality.
- Migration strategies.
- Expected benefits (e.g., faster responsiveness, better audio reliability).
- An estimated timeline.
## 4. Scope of Current Work
The immediate next step, as requested by the user, is to:
1. Review the existing `gradio_app.py`.
2. Refine or detail the plan for re-implementing the dialog generation functionality with the new stack, leveraging the existing comprehensive plan.
This document will be updated as the project progresses to reflect new decisions, architectural changes, and milestones.

46
.note/session_log.md Normal file
View File

@ -0,0 +1,46 @@
# Session Log
---
**Session Start:** 2025-06-05 (Continued)
**Goal:** Progress Phase 1 of Chatterbox TTS backend migration: Initial Project Setup.
**Key Activities & Insights:**
- Created `backend/app/main.py` with a basic FastAPI application instance.
- Confirmed user has an existing `.venv` at the project root.
- Updated `backend/README.md` to reflect usage of the root `.venv` instead of a backend-specific one.
- Adjusted venv activation paths and command execution locations (project root).
- Installed backend dependencies from `backend/requirements.txt` into the root `.venv`.
- Successfully ran the basic FastAPI server using `uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000` from the project root.
- Verified the API is accessible.
- Confirmed all Memory Bank files are present. Reviewed `current_focus.md` and `session_log.md`.
**Next Steps:**
- Update `current_focus.md` and `session_log.md`.
- Proceed to Phase 1, Step 2: Speaker Management.
---
---
**Session Start:** 2025-06-05
**Goal:** Initiate migration of Chatterbox TTS dialog generator from Gradio to Vanilla JS + FastAPI.
**Key Activities & Insights:**
- User requested review of `gradio_app.py` and a plan for re-implementation.
- Checked for `.note/` Memory Bank directory (MEMORY[user_global]).
- Directory not found.
- Read `README.md` to gather project context.
- Created `.note/` directory and populated standard files:
- `project_overview.md` (with initial content based on README and user request).
- `current_focus.md` (outlining immediate tasks).
- `development_standards.md` (template).
- `decision_log.md` (logged decision to use Memory Bank).
- `code_structure.md` (initial thoughts on current and future structure).
- `session_log.md` (this entry).
- `interfaces.md` (template).
**Next Steps:**
- Confirm Memory Bank setup with the user.
- Proceed to review `gradio_app.py`.
---

13
babel.config.cjs Normal file
View File

@ -0,0 +1,13 @@
// babel.config.cjs
module.exports = {
presets: [
[
'@babel/preset-env',
{
targets: {
node: 'current', // Target the current version of Node.js
},
},
],
],
};

34
backend/README.md Normal file
View File

@ -0,0 +1,34 @@
# Chatterbox TTS Backend
This directory contains the FastAPI backend for the Chatterbox TTS application.
## Project Structure
- `app/`: Contains the main FastAPI application code.
- `__init__.py`: Makes `app` a Python package.
- `main.py`: FastAPI application instance and core API endpoints.
- `services/`: Business logic for TTS, dialog processing, etc.
- `models/`: Pydantic models for API request/response.
- `utils/`: Utility functions.
- `requirements.txt`: Project dependencies for the backend.
- `README.md`: This file.
## Setup & Running
It is assumed you have a Python virtual environment at the project root (e.g., `.venv`).
1. Navigate to the **project root** directory (e.g., `/Volumes/SAM2/CODE/chatterbox-test`).
2. Activate the existing Python virtual environment:
```bash
source .venv/bin/activate # On macOS/Linux
# .\.venv\Scripts\activate # On Windows
```
3. Install dependencies (ensure your terminal is in the **project root**):
```bash
pip install -r backend/requirements.txt
```
4. Run the development server (ensure your terminal is in the **project root**):
```bash
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
```
The API should then be accessible at `http://127.0.0.1:8000`.

1
backend/app/__init__.py Normal file
View File

@ -0,0 +1 @@

19
backend/app/config.py Normal file
View File

@ -0,0 +1,19 @@
from pathlib import Path
# Determine PROJECT_ROOT dynamically.
# If config.py is at /Volumes/SAM2/CODE/chatterbox-test/backend/app/config.py
# then PROJECT_ROOT (/Volumes/SAM2/CODE/chatterbox-test) is 2 levels up.
PROJECT_ROOT = Path(__file__).resolve().parents[2]
# Speaker data paths
SPEAKER_DATA_BASE_DIR = PROJECT_ROOT / "speaker_data"
SPEAKER_SAMPLES_DIR = SPEAKER_DATA_BASE_DIR / "speaker_samples"
SPEAKERS_YAML_FILE = SPEAKER_DATA_BASE_DIR / "speakers.yaml"
# TTS temporary output path (used by DialogProcessorService)
TTS_TEMP_OUTPUT_DIR = PROJECT_ROOT / "tts_temp_outputs"
# Final dialog output path (used by Dialog router and served by main app)
# These are stored within the 'backend' directory to be easily servable.
DIALOG_OUTPUT_PARENT_DIR = PROJECT_ROOT / "backend"
DIALOG_GENERATED_DIR = DIALOG_OUTPUT_PARENT_DIR / "tts_generated_dialogs"

43
backend/app/main.py Normal file
View File

@ -0,0 +1,43 @@
from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles
from fastapi.middleware.cors import CORSMiddleware
from pathlib import Path
from app.routers import speakers, dialog # Import the routers
from app import config
app = FastAPI(
title="Chatterbox TTS API",
description="API for generating TTS dialogs using Chatterbox TTS.",
version="0.1.0",
)
# CORS Middleware configuration
origins = [
"http://localhost:8001",
"http://127.0.0.1:8001",
# Add other origins if needed, e.g., your deployed frontend URL
]
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"], # Allows all methods
allow_headers=["*"], # Allows all headers
)
# Include routers
app.include_router(speakers.router, prefix="/api/speakers", tags=["Speakers"])
app.include_router(dialog.router, prefix="/api/dialog", tags=["Dialog Generation"])
@app.get("/")
async def read_root():
return {"message": "Welcome to the Chatterbox TTS API!"}
# Ensure the directory for serving generated audio exists
config.DIALOG_GENERATED_DIR.mkdir(parents=True, exist_ok=True)
# Mount StaticFiles to serve generated dialogs
app.mount("/generated_audio", StaticFiles(directory=config.DIALOG_GENERATED_DIR), name="generated_audio")
# Further endpoints for speakers, dialog generation, etc., will be added here.

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,43 @@
from pydantic import BaseModel, Field, validator
from typing import List, Union, Literal, Optional
class DialogItemBase(BaseModel):
type: str
class SpeechItem(DialogItemBase):
type: Literal['speech'] = 'speech'
speaker_id: str = Field(..., description="ID of the speaker for this speech segment.")
text: str = Field(..., description="Text content to be synthesized.")
exaggeration: Optional[float] = Field(0.5, description="Controls the expressiveness of the speech. Higher values lead to more exaggerated speech. Default from Gradio.")
cfg_weight: Optional[float] = Field(0.5, description="Classifier-Free Guidance weight. Higher values make the speech more aligned with the prompt text and speaker characteristics. Default from Gradio.")
temperature: Optional[float] = Field(0.8, description="Controls randomness in generation. Lower values make speech more deterministic, higher values more varied. Default from Gradio.")
class SilenceItem(DialogItemBase):
type: Literal['silence'] = 'silence'
duration: float = Field(..., gt=0, description="Duration of the silence in seconds.")
class DialogRequest(BaseModel):
dialog_items: List[Union[SpeechItem, SilenceItem]] = Field(..., description="A list of speech and silence items.")
output_base_name: str = Field(..., description="Base name for the output files (e.g., 'my_dialog_v1'). Extensions will be added automatically.")
@validator('dialog_items', pre=True, each_item=True)
def check_item_type(cls, item):
if not isinstance(item, dict):
raise ValueError("Each dialog item must be a dictionary.")
item_type = item.get('type')
if item_type == 'speech':
# Pydantic will handle further validation based on SpeechItem model
return item
elif item_type == 'silence':
# Pydantic will handle further validation based on SilenceItem model
return item
raise ValueError(f"Unknown dialog item type: {item_type}. Must be 'speech' or 'silence'.")
class DialogResponse(BaseModel):
log: str = Field(description="Log of the dialog generation process.")
# For now, these URLs might be relative paths or placeholders.
# Actual serving strategy will determine the final URL format.
concatenated_audio_url: Optional[str] = Field(None, description="URL/path to the concatenated audio file.")
zip_archive_url: Optional[str] = Field(None, description="URL/path to the ZIP archive of all audio files.")
temp_dir_path: Optional[str] = Field(None, description="Path to the temporary directory holding generated files, for server-side reference.")
error_message: Optional[str] = Field(None, description="Error message if the process failed globally.")

View File

@ -0,0 +1,20 @@
from pydantic import BaseModel
from typing import Optional
class SpeakerBase(BaseModel):
name: str
class SpeakerCreate(SpeakerBase):
# For receiving speaker name, file will be handled separately by FastAPI's UploadFile
pass
class Speaker(SpeakerBase):
id: str
sample_path: Optional[str] = None # Path to the speaker's audio sample
class Config:
from_attributes = True # Replaces orm_mode = True in Pydantic v2
class SpeakerResponse(SpeakerBase):
id: str
message: Optional[str] = None

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,189 @@
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
from pathlib import Path
import shutil
from app.models.dialog_models import DialogRequest, DialogResponse
from app.services.tts_service import TTSService
from app.services.speaker_service import SpeakerManagementService
from app.services.dialog_processor_service import DialogProcessorService
from app.services.audio_manipulation_service import AudioManipulationService
from app import config
router = APIRouter()
# --- Dependency Injection for Services ---
# These can be more sophisticated with a proper DI container or FastAPI's Depends system if services had complex init.
# For now, direct instantiation or simple Depends is fine.
def get_tts_service():
# Consider making device configurable
return TTSService(device="mps")
def get_speaker_management_service():
return SpeakerManagementService()
def get_dialog_processor_service(
tts_service: TTSService = Depends(get_tts_service),
speaker_service: SpeakerManagementService = Depends(get_speaker_management_service)
):
return DialogProcessorService(tts_service=tts_service, speaker_service=speaker_service)
def get_audio_manipulation_service():
return AudioManipulationService()
# --- Helper function to manage TTS model loading/unloading ---
async def manage_tts_model_lifecycle(tts_service: TTSService, task_function, *args, **kwargs):
"""Loads TTS model, executes task, then unloads model."""
try:
print("API: Loading TTS model...")
tts_service.load_model()
return await task_function(*args, **kwargs)
except Exception as e:
# Log or handle specific exceptions if needed before re-raising
print(f"API: Error during TTS model lifecycle or task execution: {e}")
raise
finally:
print("API: Unloading TTS model...")
tts_service.unload_model()
async def process_dialog_flow(
request: DialogRequest,
dialog_processor: DialogProcessorService,
audio_manipulator: AudioManipulationService,
background_tasks: BackgroundTasks
) -> DialogResponse:
"""Core logic for processing the dialog request."""
processing_log_entries = []
concatenated_audio_file_path = None
zip_archive_file_path = None
final_temp_dir_path_str = None
try:
# 1. Process dialog to generate segments
# The DialogProcessorService creates its own temp dir for segments
dialog_processing_result = await dialog_processor.process_dialog(
dialog_items=[item.model_dump() for item in request.dialog_items],
output_base_name=request.output_base_name
)
processing_log_entries.append(dialog_processing_result['log'])
segment_details = dialog_processing_result['segment_files']
temp_segment_dir = Path(dialog_processing_result['temp_dir'])
final_temp_dir_path_str = str(temp_segment_dir)
# Filter out error segments for concatenation and zipping
valid_segment_paths_for_concat = [
Path(s['path']) for s in segment_details
if s['type'] == 'speech' and s.get('path') and Path(s['path']).exists()
]
# Create a list of dicts suitable for concatenation service (speech paths and silence durations)
items_for_concatenation = []
for s_detail in segment_details:
if s_detail['type'] == 'speech' and s_detail.get('path') and Path(s_detail['path']).exists():
items_for_concatenation.append({'type': 'speech', 'path': s_detail['path']})
elif s_detail['type'] == 'silence' and 'duration' in s_detail:
items_for_concatenation.append({'type': 'silence', 'duration': s_detail['duration']})
# Errors are already logged by DialogProcessor
if not any(item['type'] == 'speech' for item in items_for_concatenation):
message = "No valid speech segments were generated. Cannot create concatenated audio or ZIP."
processing_log_entries.append(message)
return DialogResponse(
log="\n".join(processing_log_entries),
temp_dir_path=final_temp_dir_path_str,
error_message=message
)
# 2. Concatenate audio segments
config.DIALOG_GENERATED_DIR.mkdir(parents=True, exist_ok=True)
concat_filename = f"{request.output_base_name}_concatenated.wav"
concatenated_audio_file_path = config.DIALOG_GENERATED_DIR / concat_filename
audio_manipulator.concatenate_audio_segments(
segment_results=items_for_concatenation,
output_concatenated_path=concatenated_audio_file_path
)
processing_log_entries.append(f"Concatenated audio saved to: {concatenated_audio_file_path}")
# 3. Create ZIP archive
zip_filename = f"{request.output_base_name}_dialog_output.zip"
zip_archive_path = config.DIALOG_GENERATED_DIR / zip_filename
# Collect all valid generated speech segment files for zipping
individual_segment_paths = [
Path(s['path']) for s in segment_details
if s['type'] == 'speech' and s.get('path') and Path(s['path']).exists()
]
# concatenated_audio_file_path is already defined and checked for existence before this block
audio_manipulator.create_zip_archive(
segment_file_paths=individual_segment_paths,
concatenated_audio_path=concatenated_audio_file_path,
output_zip_path=zip_archive_path
)
processing_log_entries.append(f"ZIP archive created at: {zip_archive_path}")
# Schedule cleanup of the temporary segment directory
# background_tasks.add_task(shutil.rmtree, temp_segment_dir, ignore_errors=True)
# processing_log_entries.append(f"Scheduled cleanup for temporary segment directory: {temp_segment_dir}")
# For now, let's not auto-delete, so user can inspect. Cleanup can be a separate endpoint/job.
processing_log_entries.append(f"Temporary segment directory for inspection: {temp_segment_dir}")
return DialogResponse(
log="\n".join(processing_log_entries),
# URLs should be relative to a static serving path, e.g., /generated_audio/
# For now, just returning the name, assuming they are in DIALOG_OUTPUT_DIR
concatenated_audio_url=f"/generated_audio/{concat_filename}",
zip_archive_url=f"/generated_audio/{zip_filename}",
temp_dir_path=final_temp_dir_path_str
)
except FileNotFoundError as e:
error_msg = f"File not found during dialog generation: {e}"
processing_log_entries.append(error_msg)
raise HTTPException(status_code=404, detail=error_msg)
except ValueError as e:
error_msg = f"Invalid value or configuration: {e}"
processing_log_entries.append(error_msg)
raise HTTPException(status_code=400, detail=error_msg)
except RuntimeError as e:
error_msg = f"Runtime error during dialog generation: {e}"
processing_log_entries.append(error_msg)
# This could be a 500 if it's an unexpected server error
raise HTTPException(status_code=500, detail=error_msg)
except Exception as e:
import traceback
error_msg = f"An unexpected error occurred: {e}\n{traceback.format_exc()}"
processing_log_entries.append(error_msg)
raise HTTPException(status_code=500, detail=error_msg)
finally:
# Ensure logs are captured even if an early exception occurs before full response construction
if not concatenated_audio_file_path and not zip_archive_file_path and processing_log_entries:
print("Dialog generation failed. Log: \n" + "\n".join(processing_log_entries))
@router.post("/generate", response_model=DialogResponse)
async def generate_dialog_endpoint(
request: DialogRequest,
background_tasks: BackgroundTasks,
tts_service: TTSService = Depends(get_tts_service),
dialog_processor: DialogProcessorService = Depends(get_dialog_processor_service),
audio_manipulator: AudioManipulationService = Depends(get_audio_manipulation_service)
):
"""
Generates a dialog from a list of speech and silence items.
- Processes text into manageable chunks.
- Generates speech for each chunk using the specified speaker.
- Inserts silences as requested.
- Concatenates all audio segments into a single file.
- Creates a ZIP archive of all individual segments and the concatenated file.
"""
# Wrap the core processing logic with model loading/unloading
return await manage_tts_model_lifecycle(
tts_service,
process_dialog_flow,
request=request,
dialog_processor=dialog_processor,
audio_manipulator=audio_manipulator,
background_tasks=background_tasks
)

View File

@ -0,0 +1,81 @@
from typing import List, Annotated
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, Form
from app.models.speaker_models import Speaker, SpeakerResponse
from app.services.speaker_service import SpeakerManagementService
router = APIRouter(
tags=["Speakers"],
responses={404: {"description": "Not found"}},
)
# Dependency to get the speaker service instance
# This could be more sophisticated with a proper DI system later
def get_speaker_service():
return SpeakerManagementService()
@router.get("/", response_model=List[Speaker])
async def get_all_speakers(
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Retrieve all available speakers.
"""
return service.get_speakers()
@router.post("/", response_model=SpeakerResponse, status_code=201)
async def create_new_speaker(
name: Annotated[str, Form()],
audio_file: Annotated[UploadFile, File()],
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Add a new speaker.
Requires speaker name (form data) and an audio sample file (file upload).
"""
if not audio_file.filename:
raise HTTPException(status_code=400, detail="No audio file provided.")
if not audio_file.content_type or not audio_file.content_type.startswith("audio/"):
raise HTTPException(status_code=400, detail="Invalid audio file type. Please upload a valid audio file (e.g., WAV, MP3).")
try:
new_speaker = await service.add_speaker(name=name, audio_file=audio_file)
return SpeakerResponse(
id=new_speaker.id,
name=new_speaker.name,
message="Speaker added successfully."
)
except HTTPException as e:
# Re-raise HTTPExceptions from the service (e.g., file save error)
raise e
except Exception as e:
# Catch-all for other unexpected errors
raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {str(e)}")
@router.get("/{speaker_id}", response_model=Speaker)
async def get_speaker_details(
speaker_id: str,
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Get details for a specific speaker by ID.
"""
speaker = service.get_speaker_by_id(speaker_id)
if not speaker:
raise HTTPException(status_code=404, detail="Speaker not found")
return speaker
@router.delete("/{speaker_id}", response_model=dict)
async def remove_speaker(
speaker_id: str,
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Delete a speaker by ID.
"""
deleted = service.delete_speaker(speaker_id)
if not deleted:
raise HTTPException(status_code=404, detail="Speaker not found or could not be deleted.")
return {"message": "Speaker deleted successfully"}

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,241 @@
import torch
import torchaudio
from pathlib import Path
from typing import List, Dict, Union, Tuple
import zipfile
# Define a common sample rate, e.g., from the TTS model. This should ideally be configurable or dynamically obtained.
# For now, let's assume the TTS model (ChatterboxTTS) outputs at a known sample rate.
# The ChatterboxTTS model.sr is 24000.
DEFAULT_SAMPLE_RATE = 24000
class AudioManipulationService:
def __init__(self, default_sample_rate: int = DEFAULT_SAMPLE_RATE):
self.sample_rate = default_sample_rate
def _load_audio(self, file_path: Union[str, Path]) -> Tuple[torch.Tensor, int]:
"""Loads an audio file and returns the waveform and sample rate."""
try:
waveform, sr = torchaudio.load(file_path)
return waveform, sr
except Exception as e:
raise RuntimeError(f"Error loading audio file {file_path}: {e}")
def _create_silence(self, duration_seconds: float) -> torch.Tensor:
"""Creates a silent audio tensor of a given duration."""
num_frames = int(duration_seconds * self.sample_rate)
return torch.zeros((1, num_frames)) # Mono silence
def concatenate_audio_segments(
self,
segment_results: List[Dict],
output_concatenated_path: Path
) -> Path:
"""
Concatenates audio segments and silences into a single audio file.
Args:
segment_results: A list of dictionaries, where each dict represents an audio
segment or a silence. Expected format:
For speech: {'type': 'speech', 'path': 'path/to/audio.wav', ...}
For silence: {'type': 'silence', 'duration': 0.5, ...}
output_concatenated_path: The path to save the final concatenated audio file.
Returns:
The path to the concatenated audio file.
"""
all_waveforms: List[torch.Tensor] = []
current_sample_rate = self.sample_rate # Assume this initially, verify with first loaded audio
for i, segment_info in enumerate(segment_results):
segment_type = segment_info.get("type")
if segment_type == "speech":
audio_path_str = segment_info.get("path")
if not audio_path_str:
print(f"Warning: Speech segment {i} has no path. Skipping.")
continue
audio_path = Path(audio_path_str)
if not audio_path.exists():
print(f"Warning: Audio file {audio_path} for segment {i} not found. Skipping.")
continue
try:
waveform, sr = self._load_audio(audio_path)
# Ensure consistent sample rate. Resample if necessary.
# For simplicity, this example assumes all inputs will match self.sample_rate
# or the first loaded audio's sample rate. A more robust implementation
# would resample if sr != current_sample_rate.
if i == 0 and not all_waveforms: # First audio segment sets the reference SR if not default
current_sample_rate = sr
if sr != self.sample_rate:
print(f"Warning: First audio segment SR ({sr} Hz) differs from service default SR ({self.sample_rate} Hz). Using segment SR.")
if sr != current_sample_rate:
print(f"Warning: Sample rate mismatch for {audio_path} ({sr} Hz) vs expected ({current_sample_rate} Hz). Resampling...")
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=current_sample_rate)
waveform = resampler(waveform)
# Ensure mono. If stereo, take the mean or first channel.
if waveform.shape[0] > 1:
waveform = torch.mean(waveform, dim=0, keepdim=True)
all_waveforms.append(waveform)
except Exception as e:
print(f"Error processing speech segment {audio_path}: {e}. Skipping.")
elif segment_type == "silence":
duration = segment_info.get("duration")
if duration is None or not isinstance(duration, (int, float)) or duration < 0:
print(f"Warning: Silence segment {i} has invalid duration. Skipping.")
continue
silence_waveform = self._create_silence(float(duration))
all_waveforms.append(silence_waveform)
elif segment_type == "error":
# Errors are already logged by DialogProcessorService, just skip here.
print(f"Skipping segment {i} due to previous error: {segment_info.get('message')}")
continue
else:
print(f"Warning: Unknown segment type '{segment_type}' at index {i}. Skipping.")
if not all_waveforms:
raise ValueError("No valid audio segments or silences found to concatenate.")
# Concatenate all waveforms
final_waveform = torch.cat(all_waveforms, dim=1)
# Ensure output directory exists
output_concatenated_path.parent.mkdir(parents=True, exist_ok=True)
# Save the concatenated audio
try:
torchaudio.save(str(output_concatenated_path), final_waveform, current_sample_rate)
print(f"Concatenated audio saved to: {output_concatenated_path}")
return output_concatenated_path
except Exception as e:
raise RuntimeError(f"Error saving concatenated audio to {output_concatenated_path}: {e}")
def create_zip_archive(
self,
segment_file_paths: List[Path],
concatenated_audio_path: Path,
output_zip_path: Path
) -> Path:
"""
Creates a ZIP archive containing individual audio segments and the concatenated audio file.
Args:
segment_file_paths: A list of paths to the individual audio segment files.
concatenated_audio_path: Path to the final concatenated audio file.
output_zip_path: The path to save the output ZIP archive.
Returns:
The path to the created ZIP archive.
"""
output_zip_path.parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
# Add concatenated audio
if concatenated_audio_path.exists():
zf.write(concatenated_audio_path, arcname=concatenated_audio_path.name)
else:
print(f"Warning: Concatenated audio file {concatenated_audio_path} not found for zipping.")
# Add individual segments
segments_dir_name = "segments"
for file_path in segment_file_paths:
if file_path.exists() and file_path.is_file():
# Store segments in a subdirectory within the zip for organization
zf.write(file_path, arcname=Path(segments_dir_name) / file_path.name)
else:
print(f"Warning: Segment file {file_path} not found or is not a file. Skipping for zipping.")
print(f"ZIP archive created at: {output_zip_path}")
return output_zip_path
# Example Usage (Test Block)
if __name__ == "__main__":
import tempfile
import shutil
# Create a temporary directory for test files
test_temp_dir = Path(tempfile.mkdtemp(prefix="audio_manip_test_"))
print(f"Created temporary test directory: {test_temp_dir}")
# Instance of the service
audio_service = AudioManipulationService()
# --- Test Data Setup ---
# Create dummy audio files (e.g., short silences with different names)
dummy_sr = audio_service.sample_rate
segment1_path = test_temp_dir / "segment1_speech.wav"
segment2_path = test_temp_dir / "segment2_speech.wav"
torchaudio.save(str(segment1_path), audio_service._create_silence(1.0), dummy_sr)
# Create a dummy segment with a different sample rate to test resampling
dummy_sr_alt = 16000
temp_waveform_alt_sr = torch.rand((1, int(0.5 * dummy_sr_alt))) # 0.5s at 16kHz
torchaudio.save(str(segment2_path), temp_waveform_alt_sr, dummy_sr_alt)
segment_results_for_concat = [
{"type": "speech", "path": str(segment1_path), "speaker_id": "spk1", "text_chunk": "Test 1"},
{"type": "silence", "duration": 0.5},
{"type": "speech", "path": str(segment2_path), "speaker_id": "spk2", "text_chunk": "Test 2 (alt SR)"},
{"type": "error", "message": "Simulated error, should be skipped"},
{"type": "speech", "path": "non_existent_segment.wav"}, # Test non-existent file
{"type": "silence", "duration": -0.2} # Test invalid duration
]
concatenated_output_path = test_temp_dir / "final_concatenated_audio.wav"
zip_output_path = test_temp_dir / "audio_archive.zip"
all_segment_files_for_zip = [segment1_path, segment2_path]
try:
# Test concatenation
print("\n--- Testing Concatenation ---")
actual_concat_path = audio_service.concatenate_audio_segments(
segment_results_for_concat,
concatenated_output_path
)
print(f"Concatenation test successful. Output: {actual_concat_path}")
assert actual_concat_path.exists()
# Basic check: load concatenated and verify duration (approx)
concat_wav, concat_sr = audio_service._load_audio(actual_concat_path)
expected_duration = 1.0 + 0.5 + 0.5 # seg1 (1.0s) + silence (0.5s) + seg2 (0.5s) = 2.0s
actual_duration = concat_wav.shape[1] / concat_sr
print(f"Expected duration (approx): {expected_duration}s, Actual duration: {actual_duration:.2f}s")
assert abs(actual_duration - expected_duration) < 0.1 # Allow small deviation
# Test Zipping
print("\n--- Testing Zipping ---")
actual_zip_path = audio_service.create_zip_archive(
all_segment_files_for_zip,
actual_concat_path,
zip_output_path
)
print(f"Zipping test successful. Output: {actual_zip_path}")
assert actual_zip_path.exists()
# Verify zip contents (basic check)
segments_dir_name = "segments" # Define this for the assertion below
with zipfile.ZipFile(actual_zip_path, 'r') as zf_read:
zip_contents = zf_read.namelist()
print(f"ZIP contents: {zip_contents}")
assert Path(segments_dir_name) / segment1_path.name in [Path(p) for p in zip_contents]
assert Path(segments_dir_name) / segment2_path.name in [Path(p) for p in zip_contents]
assert concatenated_output_path.name in zip_contents
print("\nAll AudioManipulationService tests passed!")
except Exception as e:
import traceback
print(f"\nAn error occurred during AudioManipulationService tests:")
traceback.print_exc()
finally:
# Clean up temporary directory
# shutil.rmtree(test_temp_dir)
# print(f"Cleaned up temporary test directory: {test_temp_dir}")
print(f"Test files are in {test_temp_dir}. Please inspect and delete manually if needed.")

View File

@ -0,0 +1,265 @@
from pathlib import Path
from typing import List, Dict, Any, Union
import re
from .tts_service import TTSService
from .speaker_service import SpeakerManagementService
from app import config
# Potentially models for dialog structure if we define them
# from ..models.dialog_models import DialogItem # Example
class DialogProcessorService:
def __init__(self, tts_service: TTSService, speaker_service: SpeakerManagementService):
self.tts_service = tts_service
self.speaker_service = speaker_service
# Base directory for storing individual audio segments during processing
self.temp_audio_dir = config.TTS_TEMP_OUTPUT_DIR
self.temp_audio_dir.mkdir(parents=True, exist_ok=True)
def _split_text(self, text: str, max_length: int = 300) -> List[str]:
"""
Splits text into chunks suitable for TTS processing, attempting to respect sentence boundaries.
Similar to split_text_at_sentence_boundaries from the original Gradio app.
Max_length is approximate, as it tries to finish sentences.
"""
# Basic sentence splitting using common delimiters. More sophisticated NLP could be used.
# This regex tries to split by '.', '!', '?', '...', followed by space or end of string.
# It also handles cases where these delimiters might be followed by quotes or parentheses.
sentences = re.split(r'(?<=[.!?\u2026])\s+|(?<=[.!?\u2026])(?=["\')\]\}\u201d\u2019])|(?<=[.!?\u2026])$', text.strip())
sentences = [s.strip() for s in sentences if s and s.strip()]
chunks = []
current_chunk = ""
for sentence in sentences:
if not sentence:
continue
if not current_chunk: # First sentence for this chunk
current_chunk = sentence
elif len(current_chunk) + len(sentence) + 1 <= max_length:
current_chunk += " " + sentence
else:
chunks.append(current_chunk)
current_chunk = sentence
if current_chunk: # Add the last chunk
chunks.append(current_chunk)
# Further split any chunks that are still too long (e.g., a single very long sentence)
final_chunks = []
for chunk in chunks:
if len(chunk) > max_length:
# Simple split by length if a sentence itself is too long
for i in range(0, len(chunk), max_length):
final_chunks.append(chunk[i:i+max_length])
else:
final_chunks.append(chunk)
return final_chunks
async def process_dialog(self, dialog_items: List[Dict[str, Any]], output_base_name: str) -> Dict[str, Any]:
"""
Processes a list of dialog items (speech or silence) to generate audio segments.
Args:
dialog_items: A list of dictionaries, where each item has:
- 'type': 'speech' or 'silence'
- For 'speech': 'speaker_id': str, 'text': str
- For 'silence': 'duration': float (in seconds)
output_base_name: The base name for the output files.
Returns:
A dictionary containing paths to generated segments and other processing info.
Example: {
"log": "Processing complete...",
"segment_files": [
{"type": "speech", "path": "/path/to/segment1.wav", "speaker_id": "X", "text_chunk": "..."},
{"type": "silence", "duration": 0.5},
{"type": "speech", "path": "/path/to/segment2.wav", "speaker_id": "Y", "text_chunk": "..."}
],
"temp_dir": str(self.temp_audio_dir / output_base_name)
}
"""
segment_results = []
processing_log = []
# Create a unique subdirectory for this dialog's temporary files
dialog_temp_dir = self.temp_audio_dir / output_base_name
dialog_temp_dir.mkdir(parents=True, exist_ok=True)
processing_log.append(f"Created temporary directory for segments: {dialog_temp_dir}")
segment_idx = 0
for i, item in enumerate(dialog_items):
item_type = item.get("type")
processing_log.append(f"Processing item {i+1}: type='{item_type}'")
if item_type == "speech":
speaker_id = item.get("speaker_id")
text = item.get("text")
if not speaker_id or not text:
processing_log.append(f"Skipping speech item {i+1} due to missing speaker_id or text.")
segment_results.append({"type": "error", "message": "Missing speaker_id or text"})
continue
# Validate speaker_id and get speaker_sample_path
speaker_info = self.speaker_service.get_speaker_by_id(speaker_id)
if not speaker_info:
processing_log.append(f"Speaker ID '{speaker_id}' not found. Skipping item {i+1}.")
segment_results.append({"type": "error", "message": f"Speaker ID '{speaker_id}' not found"})
continue
if not speaker_info.sample_path:
processing_log.append(f"Speaker ID '{speaker_id}' has no sample path defined. Skipping item {i+1}.")
segment_results.append({"type": "error", "message": f"Speaker ID '{speaker_id}' has no sample path defined"})
continue
# speaker_info.sample_path is relative to config.SPEAKER_DATA_BASE_DIR
abs_speaker_sample_path = config.SPEAKER_DATA_BASE_DIR / speaker_info.sample_path
if not abs_speaker_sample_path.is_file():
processing_log.append(f"Speaker sample file not found or is not a file at '{abs_speaker_sample_path}' for speaker ID '{speaker_id}'. Skipping item {i+1}.")
segment_results.append({"type": "error", "message": f"Speaker sample not a file or not found: {abs_speaker_sample_path}"})
continue
text_chunks = self._split_text(text)
processing_log.append(f"Split text for speaker '{speaker_id}' into {len(text_chunks)} chunk(s).")
for chunk_idx, text_chunk in enumerate(text_chunks):
segment_filename_base = f"{output_base_name}_seg{segment_idx}_spk{speaker_id}_chunk{chunk_idx}"
processing_log.append(f"Generating speech for chunk: '{text_chunk[:50]}...' using speaker '{speaker_id}'")
try:
segment_output_path = await self.tts_service.generate_speech(
text=text_chunk,
speaker_id=speaker_id, # For metadata, actual sample path is used by TTS
speaker_sample_path=str(abs_speaker_sample_path),
output_filename_base=segment_filename_base,
output_dir=dialog_temp_dir, # Save to the dialog's temp dir
exaggeration=item.get('exaggeration', 0.5), # Default from Gradio, Pydantic model should provide this
cfg_weight=item.get('cfg_weight', 0.5), # Default from Gradio, Pydantic model should provide this
temperature=item.get('temperature', 0.8) # Default from Gradio, Pydantic model should provide this
)
segment_results.append({
"type": "speech",
"path": str(segment_output_path),
"speaker_id": speaker_id,
"text_chunk": text_chunk
})
processing_log.append(f"Successfully generated segment: {segment_output_path}")
except Exception as e:
error_message = f"Error generating speech for chunk '{text_chunk[:50]}...': {repr(e)}"
processing_log.append(error_message)
segment_results.append({"type": "error", "message": error_message, "text_chunk": text_chunk})
segment_idx += 1
elif item_type == "silence":
duration = item.get("duration")
if duration is None or duration < 0:
processing_log.append(f"Skipping silence item {i+1} due to invalid duration.")
segment_results.append({"type": "error", "message": "Invalid duration for silence"})
continue
segment_results.append({"type": "silence", "duration": float(duration)})
processing_log.append(f"Added silence of {duration}s.")
else:
processing_log.append(f"Unknown item type '{item_type}' at item {i+1}. Skipping.")
segment_results.append({"type": "error", "message": f"Unknown item type: {item_type}"})
return {
"log": "\n".join(processing_log),
"segment_files": segment_results,
"temp_dir": str(dialog_temp_dir) # For cleanup or zipping later
}
if __name__ == "__main__":
import asyncio
import pprint
async def main_test():
# Initialize services
tts_service = TTSService(device="mps") # or your preferred device
speaker_service = SpeakerManagementService()
dialog_processor = DialogProcessorService(tts_service, speaker_service)
# Ensure dummy speaker sample exists (TTSService test block usually creates this)
# For robustness, we can call the TTSService test logic or ensure it's run prior.
# Here, we assume dummy_speaker_test.wav is available as per previous steps.
# If not, the 'test_speaker_for_dialog_proc' will fail file validation.
# First, ensure the dummy speaker file is created by TTSService's own test logic
# This is a bit of a hack for testing; ideally, test assets are managed independently.
try:
print("Ensuring dummy speaker sample is created by running TTSService's main_test logic...")
from .tts_service import main_test as tts_main_test
await tts_main_test() # This will create the dummy_speaker_test.wav
print("TTSService main_test completed, dummy sample should exist.")
except ImportError:
print("Could not import tts_service.main_test directly. Ensure dummy_speaker_test.wav exists.")
except Exception as e:
print(f"Error running tts_service.main_test for dummy sample creation: {e}")
print("Proceeding, but 'test_speaker_for_dialog_proc' might fail if sample is missing.")
sample_dialog_items = [
{
"type": "speech",
"speaker_id": "test_speaker_for_dialog_proc", # Defined in speakers.yaml
"text": "Hello world! This is the first speech segment."
},
{
"type": "silence",
"duration": 0.75
},
{
"type": "speech",
"speaker_id": "test_speaker_for_dialog_proc",
"text": "This is a much longer piece of text that should definitely be split into multiple, smaller chunks by the dialog processor. It contains several sentences. Let's see how it handles this. The maximum length is set to 300 characters, but it tries to respect sentence boundaries. This sentence itself is quite long and might even be split mid-sentence if it exceeds the hard limit after sentence splitting. We will observe the output carefully to ensure it works as expected, creating multiple audio files for this single text block if necessary."
},
{
"type": "speech",
"speaker_id": "non_existent_speaker_id",
"text": "This should fail because the speaker does not exist."
},
{
"type": "invalid_type",
"text": "This item has an invalid type."
},
{
"type": "speech",
"speaker_id": "test_speaker_for_dialog_proc",
"text": None # Test missing text
},
{
"type": "speech",
"speaker_id": None, # Test missing speaker_id
"text": "This is a test with a missing speaker ID."
},
{
"type": "silence",
"duration": -0.5 # Invalid duration
}
]
output_base_name = "dialog_processor_test_run"
try:
print(f"\nLoading TTS model for DialogProcessorService test...")
# TTSService's generate_speech will load the model if not already loaded.
# However, explicit load/unload is good practice for a test block.
tts_service.load_model()
print(f"\nProcessing dialog items with base name: {output_base_name}...")
results = await dialog_processor.process_dialog(sample_dialog_items, output_base_name)
print("\n--- Processing Log ---")
print(results.get("log"))
print("\n--- Segment Files / Results ---")
pprint.pprint(results.get("segment_files"))
print(f"\nTemporary directory used: {results.get('temp_dir')}")
print("\nPlease check the temporary directory for generated audio segments.")
except Exception as e:
import traceback
print(f"\nAn error occurred during the DialogProcessorService test:")
traceback.print_exc()
finally:
print("\nUnloading TTS model...")
tts_service.unload_model()
print("DialogProcessorService test finished.")
asyncio.run(main_test())

View File

@ -0,0 +1,147 @@
import yaml
import uuid
import os
import io # Added for BytesIO
import torchaudio # Added for audio processing
from pathlib import Path
from typing import List, Dict, Optional, Any
from fastapi import UploadFile, HTTPException
from app.models.speaker_models import Speaker, SpeakerCreate
from app import config
class SpeakerManagementService:
def __init__(self):
self._ensure_data_files_exist()
self.speakers_data = self._load_speakers_data()
def _ensure_data_files_exist(self):
"""Ensures the speaker data directory and YAML file exist."""
config.SPEAKER_DATA_BASE_DIR.mkdir(parents=True, exist_ok=True)
config.SPEAKER_SAMPLES_DIR.mkdir(parents=True, exist_ok=True)
if not config.SPEAKERS_YAML_FILE.exists():
with open(config.SPEAKERS_YAML_FILE, 'w') as f:
yaml.dump({}, f) # Initialize with an empty dict, as per previous fixes
def _load_speakers_data(self) -> Dict[str, Any]: # Changed return type to Dict
"""Loads speaker data from the YAML file."""
try:
with open(config.SPEAKERS_YAML_FILE, 'r') as f:
data = yaml.safe_load(f)
return data if isinstance(data, dict) else {} # Ensure it's a dict
except FileNotFoundError:
return {}
except yaml.YAMLError:
# Handle corrupted YAML file, e.g., log error and return empty list
print(f"Error: Corrupted speakers YAML file at {config.SPEAKERS_YAML_FILE}")
return {}
def _save_speakers_data(self):
"""Saves the current speaker data to the YAML file."""
with open(config.SPEAKERS_YAML_FILE, 'w') as f:
yaml.dump(self.speakers_data, f, sort_keys=False)
def get_speakers(self) -> List[Speaker]:
"""Returns a list of all speakers."""
# self.speakers_data is now a dict: {speaker_id: {name: ..., sample_path: ...}}
return [Speaker(id=spk_id, **spk_attrs) for spk_id, spk_attrs in self.speakers_data.items()]
def get_speaker_by_id(self, speaker_id: str) -> Optional[Speaker]:
"""Retrieves a speaker by their ID."""
if speaker_id in self.speakers_data:
speaker_attributes = self.speakers_data[speaker_id]
return Speaker(id=speaker_id, **speaker_attributes)
return None
async def add_speaker(self, name: str, audio_file: UploadFile) -> Speaker:
"""Adds a new speaker, converts sample to WAV, saves it, and updates YAML."""
speaker_id = str(uuid.uuid4())
# Define standardized sample filename and path (always WAV)
sample_filename = f"{speaker_id}.wav"
sample_path = config.SPEAKER_SAMPLES_DIR / sample_filename
try:
content = await audio_file.read()
# Use BytesIO to handle the in-memory audio data for torchaudio
audio_buffer = io.BytesIO(content)
# Load audio data using torchaudio, this handles various formats (MP3, WAV, etc.)
# waveform is a tensor, sample_rate is an int
waveform, sample_rate = torchaudio.load(audio_buffer)
# Save the audio data as WAV
# Ensure the SPEAKER_SAMPLES_DIR exists (though _ensure_data_files_exist should handle it)
config.SPEAKER_SAMPLES_DIR.mkdir(parents=True, exist_ok=True)
torchaudio.save(str(sample_path), waveform, sample_rate, format="wav")
except torchaudio.TorchaudioException as e:
# More specific error for torchaudio issues (e.g. unsupported format, corrupted file)
raise HTTPException(status_code=400, detail=f"Error processing audio file: {e}. Ensure it's a valid audio format (e.g., WAV, MP3).")
except Exception as e:
# General error handling for other issues (e.g., file system errors)
raise HTTPException(status_code=500, detail=f"Could not save audio file: {e}")
finally:
await audio_file.close()
new_speaker_data = {
"id": speaker_id,
"name": name,
"sample_path": str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR)) # Store path relative to speaker_data dir
}
# self.speakers_data is now a dict
self.speakers_data[speaker_id] = {
"name": name,
"sample_path": str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR))
}
self._save_speakers_data()
# Construct Speaker model for return, including the ID
return Speaker(id=speaker_id, name=name, sample_path=str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR)))
def delete_speaker(self, speaker_id: str) -> bool:
"""Deletes a speaker and their audio sample."""
# Speaker data is now a dictionary, keyed by speaker_id
speaker_to_delete = self.speakers_data.pop(speaker_id, None)
if speaker_to_delete:
self._save_speakers_data()
sample_path_str = speaker_to_delete.get("sample_path")
if sample_path_str:
# sample_path_str is relative to SPEAKER_DATA_BASE_DIR
full_sample_path = config.SPEAKER_DATA_BASE_DIR / sample_path_str
try:
if full_sample_path.is_file(): # Check if it's a file before removing
os.remove(full_sample_path)
except OSError as e:
# Log error if file deletion fails but proceed
print(f"Error deleting sample file {full_sample_path}: {e}")
return True
return False
# Example usage (for testing, not part of the service itself)
if __name__ == "__main__":
service = SpeakerManagementService()
print("Initial speakers:", service.get_speakers())
# This part would require a mock UploadFile to run directly
# print("\nAdding a new speaker (manual test setup needed for UploadFile)")
# class MockUploadFile:
# def __init__(self, filename, content):
# self.filename = filename
# self._content = content
# async def read(self): return self._content
# async def close(self): pass
# import asyncio
# async def test_add():
# mock_file = MockUploadFile("test.wav", b"dummy audio content")
# new_speaker = await service.add_speaker(name="Test Speaker", audio_file=mock_file)
# print("\nAdded speaker:", new_speaker)
# print("Speakers after add:", service.get_speakers())
# return new_speaker.id
# speaker_id_to_delete = asyncio.run(test_add())
# if speaker_id_to_delete:
# print(f"\nDeleting speaker {speaker_id_to_delete}")
# service.delete_speaker(speaker_id_to_delete)
# print("Speakers after delete:", service.get_speakers())

View File

@ -0,0 +1,155 @@
import torch
import torchaudio
from typing import Optional
from chatterbox.tts import ChatterboxTTS
from pathlib import Path
import gc # Garbage collector for memory management
# Define a directory for TTS model outputs, could be temporary or configurable
TTS_OUTPUT_DIR = Path("/Volumes/SAM2/CODE/chatterbox-test/tts_outputs") # Example path
class TTSService:
def __init__(self, device: str = "mps"): # Default to MPS for Macs, can be "cpu" or "cuda"
self.device = device
self.model = None
self._ensure_output_dir_exists()
def _ensure_output_dir_exists(self):
"""Ensures the TTS output directory exists."""
TTS_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
def load_model(self):
"""Loads the ChatterboxTTS model."""
if self.model is None:
print(f"Loading ChatterboxTTS model to device: {self.device}...")
try:
self.model = ChatterboxTTS.from_pretrained(device=self.device)
print("ChatterboxTTS model loaded successfully.")
except Exception as e:
print(f"Error loading ChatterboxTTS model: {e}")
# Potentially raise an exception or handle appropriately
raise
else:
print("ChatterboxTTS model already loaded.")
def unload_model(self):
"""Unloads the model and clears memory."""
if self.model is not None:
print("Unloading ChatterboxTTS model and clearing cache...")
del self.model
self.model = None
if self.device == "cuda":
torch.cuda.empty_cache()
elif self.device == "mps":
if hasattr(torch.mps, "empty_cache"): # Check if empty_cache is available for MPS
torch.mps.empty_cache()
gc.collect() # Explicitly run garbage collection
print("Model unloaded and memory cleared.")
async def generate_speech(
self,
text: str,
speaker_sample_path: str, # Absolute path to the speaker's audio sample
output_filename_base: str, # e.g., "dialog_line_1_spk_X_chunk_0"
speaker_id: Optional[str] = None, # Optional, mainly for logging if needed, filename base is primary
output_dir: Optional[Path] = None, # Optional, defaults to TTS_OUTPUT_DIR from this module
exaggeration: float = 0.5, # Default from Gradio
cfg_weight: float = 0.5, # Default from Gradio
temperature: float = 0.8, # Default from Gradio
) -> Path:
"""
Generates speech from text using the loaded TTS model and a speaker sample.
Saves the output to a .wav file.
"""
if self.model is None:
self.load_model()
if self.model is None: # Check again if loading failed
raise RuntimeError("TTS model is not loaded. Cannot generate speech.")
# Ensure speaker_sample_path is valid
speaker_sample_p = Path(speaker_sample_path)
if not speaker_sample_p.exists() or not speaker_sample_p.is_file():
raise FileNotFoundError(f"Speaker sample audio file not found: {speaker_sample_path}")
target_output_dir = output_dir if output_dir is not None else TTS_OUTPUT_DIR
target_output_dir.mkdir(parents=True, exist_ok=True)
# output_filename_base from DialogProcessorService is expected to be comprehensive (e.g., includes speaker_id, segment info)
output_file_path = target_output_dir / f"{output_filename_base}.wav"
print(f"Generating audio for text: \"{text[:50]}...\" with speaker sample: {speaker_sample_path}")
try:
with torch.no_grad(): # Important for inference
wav = self.model.generate(
text=text,
audio_prompt_path=str(speaker_sample_p), # Must be a string path
exaggeration=exaggeration,
cfg_weight=cfg_weight,
temperature=temperature,
)
torchaudio.save(str(output_file_path), wav, self.model.sr)
print(f"Audio saved to: {output_file_path}")
return output_file_path
except Exception as e:
print(f"Error during TTS generation or saving: {e}")
raise
finally:
# For now, we keep it loaded. Memory management might need refinement.
pass
# Example usage (for testing, not part of the service itself)
if __name__ == "__main__":
async def main_test():
tts_service = TTSService(device="mps")
try:
tts_service.load_model()
dummy_speaker_root = Path("/Volumes/SAM2/CODE/chatterbox-test/speaker_data/speaker_samples")
dummy_speaker_root.mkdir(parents=True, exist_ok=True)
dummy_sample_file = dummy_speaker_root / "dummy_speaker_test.wav"
import os # Added for os.remove
# Always try to remove an existing dummy file to ensure a fresh one is created
if dummy_sample_file.exists():
try:
os.remove(dummy_sample_file)
print(f"Removed existing dummy sample: {dummy_sample_file}")
except OSError as e:
print(f"Error removing existing dummy sample {dummy_sample_file}: {e}")
# Proceeding, but torchaudio.save might fail or overwrite
print(f"Creating new dummy speaker sample: {dummy_sample_file}")
# Create a minimal, silent WAV file for testing
sample_rate = 22050
duration = 1 # seconds
num_channels = 1
num_frames = sample_rate * duration
audio_data = torch.zeros((num_channels, num_frames))
try:
torchaudio.save(str(dummy_sample_file), audio_data, sample_rate)
print(f"Dummy sample created successfully: {dummy_sample_file}")
except Exception as save_e:
print(f"Could not create dummy sample: {save_e}")
# If creation fails, the subsequent generation test will likely also fail or be skipped.
if dummy_sample_file.exists():
output_path = await tts_service.generate_speech(
text="Hello, this is a test of the Text-to-Speech service.",
speaker_id="test_speaker",
speaker_sample_path=str(dummy_sample_file),
output_filename_base="test_generation"
)
print(f"Test generation output: {output_path}")
else:
print(f"Skipping generation test as dummy sample {dummy_sample_file} not found.")
except Exception as e:
import traceback
print(f"Error during TTS generation or saving:")
traceback.print_exc()
finally:
tts_service.unload_model()
import asyncio
asyncio.run(main_test())

7
backend/requirements.txt Normal file
View File

@ -0,0 +1,7 @@
fastapi
uvicorn[standard]
python-multipart
PyYAML
torch
torchaudio
chatterbox-tts

108
backend/run_api_test.py Normal file
View File

@ -0,0 +1,108 @@
import requests
import json
from pathlib import Path
import time
# Configuration
API_BASE_URL = "http://localhost:8000/api/dialog"
ENDPOINT_URL = f"{API_BASE_URL}/generate"
# Define project root relative to this test script (assuming it's in backend/)
PROJECT_ROOT = Path(__file__).resolve().parent
GENERATED_DIALOGS_DIR = PROJECT_ROOT / "tts_generated_dialogs"
DIALOG_PAYLOAD = {
"output_base_name": "test_dialog_from_script",
"dialog_items": [
{
"type": "speech",
"speaker_id": "dummy_speaker", # Ensure this speaker exists in your speakers.yaml and has a sample .wav
"text": "This is a test from the Python script. One, two, three.",
"exaggeration": 1.5,
"cfg_weight": 4.0,
"temperature": 0.5
},
{
"type": "silence",
"duration": 0.5
},
{
"type": "speech",
"speaker_id": "dummy_speaker",
"text": "Testing complete. All systems nominal."
},
{
"type": "speech",
"speaker_id": "non_existent_speaker", # Test case for invalid speaker
"text": "This should produce an error for this segment."
},
{
"type": "silence",
"duration": 0.25 # Changed to valid duration
}
]
}
def run_test():
print(f"Sending POST request to: {ENDPOINT_URL}")
print("Payload:")
print(json.dumps(DIALOG_PAYLOAD, indent=2))
print("-" * 50)
try:
start_time = time.time()
response = requests.post(ENDPOINT_URL, json=DIALOG_PAYLOAD, timeout=120) # Increased timeout for TTS processing
end_time = time.time()
print(f"Response received in {end_time - start_time:.2f} seconds.")
print(f"Status Code: {response.status_code}")
print("-" * 50)
if response.content:
try:
response_data = response.json()
print("Response JSON:")
print(json.dumps(response_data, indent=2))
print("-" * 50)
if response.status_code == 200:
print("Test PASSED (HTTP 200 OK)")
concatenated_url = response_data.get("concatenated_audio_url")
zip_url = response_data.get("zip_archive_url")
temp_dir = response_data.get("temp_dir_path")
if concatenated_url:
print(f"Concatenated audio URL: http://localhost:8000{concatenated_url}")
if zip_url:
print(f"ZIP archive URL: http://localhost:8000{zip_url}")
if temp_dir:
print(f"Temporary segment directory: {temp_dir}")
print("\nTo verify, check the generated files in:")
print(f" Concatenated/ZIP: {GENERATED_DIALOGS_DIR}")
print(f" Individual segments (if not cleaned up): {temp_dir}")
else:
print(f"Test FAILED (HTTP {response.status_code})")
if response_data.get("detail"):
print(f"Error Detail: {response_data.get('detail')}")
except json.JSONDecodeError:
print("Response content is not valid JSON:")
print(response.text)
print("Test FAILED (Invalid JSON Response)")
else:
print("Response content is empty.")
print(f"Test FAILED (Empty Response, HTTP {response.status_code})")
except requests.exceptions.ConnectionError as e:
print(f"Connection Error: {e}")
print("Test FAILED (Could not connect to the server. Is it running?)")
except requests.exceptions.Timeout as e:
print(f"Request Timeout: {e}")
print("Test FAILED (The request timed out. TTS processing might be too slow or stuck.)")
except Exception as e:
print(f"An unexpected error occurred: {e}")
print("Test FAILED (Unexpected error)")
if __name__ == "__main__":
run_test()

330
frontend/css/style.css Normal file
View File

@ -0,0 +1,330 @@
/* Modern, clean, and accessible UI styles for Chatterbox TTS */
body {
font-family: 'Segoe UI', 'Roboto', 'Arial', sans-serif;
line-height: 1.7;
margin: 0;
padding: 0;
background-color: #f7f9fa;
color: #222;
}
.container {
max-width: 1100px;
margin: 0 auto;
padding: 0 18px;
}
header {
background: #222e3a;
color: #fff;
padding: 1.5rem 0 1rem 0;
text-align: center;
border-bottom: 3px solid #4a90e2;
}
h1 {
font-size: 2.4rem;
margin: 0;
letter-spacing: 1px;
}
main {
margin-top: 30px;
margin-bottom: 30px;
}
.panel-grid {
display: flex;
flex-wrap: wrap;
gap: 28px;
justify-content: space-between;
}
.panel {
flex: 1 1 320px;
min-width: 320px;
background: none;
box-shadow: none;
border: none;
padding: 0;
}
#results-display.panel {
flex: 1 1 100%;
min-width: 0;
margin-top: 32px;
}
/* Dialog Table Styles */
#dialog-items-table {
width: 100%;
border-collapse: collapse;
background: #fff;
border-radius: 8px;
overflow: hidden;
font-size: 1rem;
margin-bottom: 0;
}
#dialog-items-table th, #dialog-items-table td {
padding: 10px 12px;
border-bottom: 1px solid #e3e3e3;
text-align: left;
}
#dialog-items-table th {
background: #f3f7fa;
color: #4a90e2;
font-weight: 600;
font-size: 1.05rem;
}
#dialog-items-table tr:last-child td {
border-bottom: none;
}
#dialog-items-table td.actions {
text-align: center;
min-width: 90px;
}
/* Collapsible log details */
details#generation-log-details {
margin-bottom: 0;
border-radius: 4px;
background: #f3f5f7;
box-shadow: 0 1px 3px rgba(44,62,80,0.04);
padding: 0 0 0 0;
transition: box-shadow 0.15s;
}
details#generation-log-details[open] {
box-shadow: 0 2px 8px rgba(44,62,80,0.07);
background: #f9fafb;
}
details#generation-log-details summary {
font-size: 1rem;
color: #357ab8;
padding: 10px 0 6px 0;
outline: none;
}
details#generation-log-details summary:focus {
outline: 2px solid #4a90e2;
border-radius: 3px;
}
@media (max-width: 900px) {
.panel-grid {
display: block;
gap: 0;
}
.panel, .full-width-panel {
min-width: 0;
width: 100%;
flex: 1 1 100%;
}
#dialog-items-table th, #dialog-items-table td {
font-size: 0.97rem;
padding: 7px 8px;
}
#speaker-management.panel {
margin-bottom: 36px;
width: 100%;
max-width: 100%;
flex: 1 1 100%;
}
}
.card {
background: #fff;
border-radius: 8px;
box-shadow: 0 2px 8px rgba(44,62,80,0.07);
padding: 18px 20px;
margin-bottom: 18px;
}
section {
margin-bottom: 0;
border-radius: 0;
padding: 0;
background: none;
}
hr {
display: none;
}
h2 {
font-size: 1.5rem;
margin-top: 0;
margin-bottom: 16px;
color: #4a90e2;
letter-spacing: 0.5px;
}
h3 {
font-size: 1.1rem;
margin-bottom: 10px;
color: #333;
}
.x-remove-btn {
background: #e74c3c;
color: #fff;
border: none;
border-radius: 50%;
width: 28px;
height: 28px;
font-size: 1.2rem;
line-height: 1;
display: inline-flex;
align-items: center;
justify-content: center;
cursor: pointer;
transition: background 0.15s;
margin: 0 2px;
box-shadow: 0 1px 2px rgba(44,62,80,0.06);
outline: none;
padding: 0;
}
.x-remove-btn:hover, .x-remove-btn:focus {
background: #c0392b;
color: #fff;
outline: 2px solid #e74c3c;
}
.form-row {
display: flex;
align-items: center;
gap: 12px;
margin-bottom: 14px;
}
label {
min-width: 120px;
font-weight: 500;
margin-bottom: 0;
}
input[type='text'], input[type='file'] {
padding: 8px 10px;
border: 1px solid #cfd8dc;
border-radius: 4px;
font-size: 1rem;
width: 100%;
box-sizing: border-box;
}
input[type='file'] {
background: #f7f7f7;
font-size: 0.97rem;
}
button {
padding: 9px 18px;
background: #4a90e2;
color: #fff;
border: none;
border-radius: 5px;
cursor: pointer;
font-size: 1rem;
font-weight: 500;
transition: background 0.15s;
margin-right: 10px;
}
button:hover, button:focus {
background: #357ab8;
outline: none;
}
.dialog-controls {
margin-bottom: 10px;
}
#speaker-list {
list-style: none;
padding: 0;
margin: 0;
}
#speaker-list li {
padding: 7px 0;
border-bottom: 1px solid #e3e3e3;
display: flex;
justify-content: space-between;
align-items: center;
}
#speaker-list li:last-child {
border-bottom: none;
}
pre {
background: #f3f5f7;
padding: 12px;
border-radius: 4px;
font-size: 0.98rem;
white-space: pre-wrap;
word-wrap: break-word;
margin: 0;
}
audio {
width: 100%;
margin-top: 8px;
margin-bottom: 8px;
}
#zip-archive-link {
display: inline-block;
margin-right: 10px;
color: #fff;
background: #4a90e2;
padding: 7px 16px;
border-radius: 4px;
text-decoration: none;
font-weight: 500;
transition: background 0.15s;
}
#zip-archive-link:hover, #zip-archive-link:focus {
background: #357ab8;
}
footer {
text-align: center;
padding: 20px 0;
background: #222e3a;
color: #fff;
margin-top: 40px;
font-size: 1rem;
border-top: 3px solid #4a90e2;
}
@media (max-width: 900px) {
.panel-grid {
flex-direction: column;
gap: 22px;
}
.panel {
min-width: 0;
}
}
/* Simple side-by-side layout for speaker management */
.speaker-mgmt-row {
display: flex;
gap: 20px;
}
.speaker-mgmt-row .card {
flex: 1;
width: 50%;
}
/* Stack on mobile */
@media (max-width: 768px) {
.speaker-mgmt-row {
flex-direction: column;
}
.speaker-mgmt-row .card {
width: 100%;
}
}

102
frontend/index.html Normal file
View File

@ -0,0 +1,102 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Chatterbox TTS Frontend</title>
<link rel="stylesheet" href="css/style.css">
</head>
<body>
<header>
<div class="container">
<h1>Chatterbox TTS</h1>
</div>
</header>
<main class="container" role="main">
<div class="panel-grid">
<section id="dialog-editor" class="panel full-width-panel" aria-labelledby="dialog-editor-title">
<h2 id="dialog-editor-title">Dialog Editor</h2>
<div class="card">
<table id="dialog-items-table">
<thead>
<tr>
<th>Type</th>
<th>Speaker</th>
<th>Text / Duration</th>
<th>Actions</th>
</tr>
</thead>
<tbody id="dialog-items-container">
<!-- Dialog items will be rendered here by JavaScript as <tr> -->
</tbody>
</table>
</div>
<div id="temp-input-area" class="card">
<!-- Temporary inputs for speech/silence will go here -->
</div>
<div class="dialog-controls form-row">
<button id="add-speech-line-btn">Add Speech Line</button>
<button id="add-silence-line-btn">Add Silence Line</button>
</div>
<div class="dialog-controls form-row">
<label for="output-base-name">Output Base Name:</label>
<input type="text" id="output-base-name" name="output-base-name" value="dialog_output" required>
</div>
<button id="generate-dialog-btn">Generate Dialog</button>
</section>
</div>
<!-- Results below -->
<section id="results-display" class="panel" aria-labelledby="results-display-title">
<h2 id="results-display-title">Results</h2>
<div class="card">
<details id="generation-log-details">
<summary style="cursor:pointer;font-weight:500;">Show Generation Log</summary>
<pre id="generation-log-content" style="margin-top:12px;">(Generation log will appear here)</pre>
</details>
</div>
<div class="card">
<h3>Concatenated Audio:</h3>
<audio id="concatenated-audio-player" controls src=""></audio>
</div>
<div class="card">
<h3>Download Archive:</h3>
<a id="zip-archive-link" href="#" download style="display: none;">Download ZIP</a>
<p id="zip-archive-placeholder">(ZIP download link will appear here)</p>
</div>
</section>
<!-- Speaker management row below Results, side by side -->
<div class="speaker-mgmt-row">
<div id="speaker-list-container" class="card">
<h3>Available Speakers</h3>
<ul id="speaker-list">
<!-- Speakers will be populated here by JavaScript -->
</ul>
</div>
<div id="add-speaker-container" class="card">
<h3>Add New Speaker</h3>
<form id="add-speaker-form">
<div class="form-row">
<label for="speaker-name">Speaker Name:</label>
<input type="text" id="speaker-name" name="name" required>
</div>
<div class="form-row">
<label for="speaker-sample">Audio Sample (WAV or MP3):</label>
<input type="file" id="speaker-sample" name="audio_file" accept=".wav,.mp3" required>
</div>
<button type="submit">Add Speaker</button>
</form>
</div>
</div>
</main>
<footer>
<div class="container">
<p>&copy; 2024 Chatterbox TTS</p>
</div>
</footer>
<script src="js/api.js" type="module"></script>
<script src="js/app.js" type="module" defer></script>
</body>
</html>

131
frontend/js/api.js Normal file
View File

@ -0,0 +1,131 @@
// frontend/js/api.js
const API_BASE_URL = 'http://localhost:8000/api'; // Assuming backend runs on port 8000
/**
* Fetches the list of available speakers.
* @returns {Promise<Array<Object>>} A promise that resolves to an array of speaker objects.
* @throws {Error} If the network response is not ok.
*/
export async function getSpeakers() {
const response = await fetch(`${API_BASE_URL}/speakers/`);
if (!response.ok) {
const errorData = await response.json().catch(() => ({ message: response.statusText }));
throw new Error(`Failed to fetch speakers: ${errorData.detail || errorData.message || response.statusText}`);
}
return response.json();
}
// We will add more functions here: addSpeaker, deleteSpeaker, generateDialog
// ... (keep API_BASE_URL and getSpeakers)
/**
* Adds a new speaker.
* @param {FormData} formData - The form data containing speaker name and audio file.
* Example: formData.append('name', 'New Speaker');
* formData.append('audio_sample_file', fileInput.files[0]);
* @returns {Promise<Object>} A promise that resolves to the new speaker object.
* @throws {Error} If the network response is not ok.
*/
export async function addSpeaker(formData) {
const response = await fetch(`${API_BASE_URL}/speakers/`, {
method: 'POST',
body: formData, // FormData sets Content-Type to multipart/form-data automatically
});
if (!response.ok) {
console.log('API_JS_ADD_SPEAKER: Entered !response.ok block. Status:', response.status, 'StatusText:', response.statusText);
let errorPayload = { detail: `Request failed with status ${response.status}` }; // Default payload
try {
console.log('API_JS_ADD_SPEAKER: Attempting to parse error response as JSON...');
errorPayload = await response.json();
console.log('API_JS_ADD_SPEAKER: Successfully parsed error JSON:', errorPayload);
} catch (e) {
console.warn('API_JS_ADD_SPEAKER: Failed to parse error response as JSON. Error:', e);
// Use statusText if JSON parsing fails
errorPayload = { detail: response.statusText || `Request failed with status ${response.status} and no JSON body.`, parseError: e.toString() };
}
console.error('--- BEGIN SERVER ERROR PAYLOAD (addSpeaker) ---');
console.error('Status:', response.status);
console.error('Status Text:', response.statusText);
console.error('Parsed Payload:', errorPayload);
console.error('--- END SERVER ERROR PAYLOAD (addSpeaker) ---');
let detailedMessage = "Unknown error";
if (errorPayload && errorPayload.detail) {
if (typeof errorPayload.detail === 'string') {
detailedMessage = errorPayload.detail;
} else {
// If detail is an array (FastAPI validation errors) or object, stringify it.
detailedMessage = JSON.stringify(errorPayload.detail);
}
} else if (errorPayload && errorPayload.message) {
detailedMessage = errorPayload.message;
} else if (response.statusText) {
detailedMessage = response.statusText;
} else {
detailedMessage = `HTTP error ${response.status}`;
}
console.log(`API_JS_ADD_SPEAKER: Constructed detailedMessage: "${detailedMessage}"`);
console.log(`API_JS_ADD_SPEAKER: Throwing error with message: "Failed to add speaker: ${detailedMessage}"`);
throw new Error(`Failed to add speaker: ${detailedMessage}`);
}
return response.json();
}
// ... (keep API_BASE_URL, getSpeakers, addSpeaker)
/**
* Deletes a speaker by their ID.
* @param {string} speakerId - The ID of the speaker to delete.
* @returns {Promise<Object>} A promise that resolves to the response data (e.g., success message).
* @throws {Error} If the network response is not ok.
*/
export async function deleteSpeaker(speakerId) {
const response = await fetch(`${API_BASE_URL}/speakers/${speakerId}/`, {
method: 'DELETE',
});
if (!response.ok) {
const errorData = await response.json().catch(() => ({ message: response.statusText }));
throw new Error(`Failed to delete speaker ${speakerId}: ${errorData.detail || errorData.message || response.statusText}`);
}
// Handle 204 No Content specifically, as .json() would fail
if (response.status === 204) {
return { message: `Speaker ${speakerId} deleted successfully.` };
}
return response.json();
}
// ... (keep API_BASE_URL, getSpeakers, addSpeaker, deleteSpeaker)
/**
* Generates a dialog by sending a payload to the backend.
* @param {Object} dialogPayload - The payload for dialog generation.
* Example:
* {
* output_base_name: "my_dialog",
* dialog_items: [
* { type: "speech", speaker_id: "speaker1", text: "Hello world.", exaggeration: 1.0, cfg_weight: 2.0, temperature: 0.7 },
* { type: "silence", duration_ms: 500 },
* { type: "speech", speaker_id: "speaker2", text: "How are you?" }
* ]
* }
* @returns {Promise<Object>} A promise that resolves to the dialog generation response (log, file URLs).
* @throws {Error} If the network response is not ok.
*/
export async function generateDialog(dialogPayload) {
const response = await fetch(`${API_BASE_URL}/dialog/generate/`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(dialogPayload),
});
if (!response.ok) {
const errorData = await response.json().catch(() => ({ message: response.statusText }));
throw new Error(`Failed to generate dialog: ${errorData.detail || errorData.message || response.statusText}`);
}
return response.json();
}

390
frontend/js/app.js Normal file
View File

@ -0,0 +1,390 @@
import { getSpeakers, addSpeaker, deleteSpeaker, generateDialog } from './api.js';
const API_BASE_URL = 'http://localhost:8000'; // Assuming backend runs here
// This should match the base URL from which FastAPI serves static files
// If your main app is at http://localhost:8000, and static files are served from /generated_audio relative to that,
// then this should be http://localhost:8000. The backend will return paths like /generated_audio/...
const API_BASE_URL_FOR_FILES = 'http://localhost:8000';
document.addEventListener('DOMContentLoaded', () => {
console.log('DOM fully loaded and parsed');
initializeSpeakerManagement();
initializeDialogEditor(); // Placeholder for now
initializeResultsDisplay(); // Placeholder for now
});
// --- Speaker Management --- //
const speakerListUL = document.getElementById('speaker-list');
const addSpeakerForm = document.getElementById('add-speaker-form');
function initializeSpeakerManagement() {
loadSpeakers();
if (addSpeakerForm) {
addSpeakerForm.addEventListener('submit', async (event) => {
event.preventDefault();
const formData = new FormData(addSpeakerForm);
const speakerName = formData.get('name');
const audioFile = formData.get('audio_file');
if (!speakerName || !audioFile || audioFile.size === 0) {
alert('Please provide a speaker name and an audio file.');
return;
}
try {
const newSpeaker = await addSpeaker(formData);
alert(`Speaker added: ${newSpeaker.name} (ID: ${newSpeaker.id})`);
addSpeakerForm.reset();
loadSpeakers(); // Refresh speaker list
} catch (error) {
console.error('Failed to add speaker:', error);
alert('Error adding speaker: ' + error.message);
}
});
}
}
async function loadSpeakers() {
if (!speakerListUL) return;
try {
const speakers = await getSpeakers();
speakerListUL.innerHTML = ''; // Clear existing list
if (speakers.length === 0) {
const listItem = document.createElement('li');
listItem.textContent = 'No speakers available.';
speakerListUL.appendChild(listItem);
return;
}
speakers.forEach(speaker => {
const listItem = document.createElement('li');
// Create a container for the speaker name and delete button
const container = document.createElement('div');
container.style.display = 'flex';
container.style.justifyContent = 'space-between';
container.style.alignItems = 'center';
container.style.width = '100%';
// Add speaker name
const nameSpan = document.createElement('span');
nameSpan.textContent = speaker.name;
container.appendChild(nameSpan);
// Add delete button
const deleteBtn = document.createElement('button');
deleteBtn.textContent = 'Delete';
deleteBtn.classList.add('delete-speaker-btn');
deleteBtn.onclick = () => handleDeleteSpeaker(speaker.id);
container.appendChild(deleteBtn);
listItem.appendChild(container);
speakerListUL.appendChild(listItem);
});
} catch (error) {
console.error('Failed to load speakers:', error);
speakerListUL.innerHTML = '<li>Error loading speakers. See console for details.</li>';
alert('Error loading speakers: ' + error.message);
}
}
async function handleDeleteSpeaker(speakerId) {
if (!speakerId) {
alert('Cannot delete speaker: Speaker ID is missing.');
return;
}
if (!confirm(`Are you sure you want to delete speaker ${speakerId}?`)) return;
try {
await deleteSpeaker(speakerId);
alert(`Speaker ${speakerId} deleted successfully.`);
loadSpeakers(); // Refresh speaker list
} catch (error) {
console.error(`Failed to delete speaker ${speakerId}:`, error);
alert(`Error deleting speaker: ${error.message}`);
}
}
// --- Dialog Editor --- //
let dialogItems = []; // Holds the sequence of speech/silence items
let availableSpeakersCache = []; // To populate speaker dropdown
function initializeDialogEditor() {
const dialogItemsContainer = document.getElementById('dialog-items-container');
const addSpeechLineBtn = document.getElementById('add-speech-line-btn');
const addSilenceLineBtn = document.getElementById('add-silence-line-btn');
const outputBaseNameInput = document.getElementById('output-base-name');
const generateDialogBtn = document.getElementById('generate-dialog-btn');
// Results Display Elements
const generationLogPre = document.getElementById('generation-log-content'); // Corrected ID
const audioPlayer = document.getElementById('concatenated-audio-player'); // Corrected ID
// audioSource will be the audioPlayer itself, no separate element by default in the HTML
const downloadZipLink = document.getElementById('zip-archive-link'); // Corrected ID
const zipArchivePlaceholder = document.getElementById('zip-archive-placeholder');
const resultsDisplaySection = document.getElementById('results-display');
let dialogItems = [];
let availableSpeakersCache = []; // Cache for speaker names and IDs
// Function to render the current dialogItems array to the DOM as table rows
function renderDialogItems() {
if (!dialogItemsContainer) return;
dialogItemsContainer.innerHTML = '';
dialogItems.forEach((item, index) => {
const tr = document.createElement('tr');
// Type column
const typeTd = document.createElement('td');
typeTd.textContent = item.type === 'speech' ? 'Speech' : 'Silence';
tr.appendChild(typeTd);
// Speaker column
const speakerTd = document.createElement('td');
if (item.type === 'speech') {
const speaker = availableSpeakersCache.find(s => s.id === item.speaker_id);
speakerTd.textContent = speaker ? speaker.name : 'Unknown Speaker';
} else {
speakerTd.textContent = '—';
}
tr.appendChild(speakerTd);
// Text/Duration column
const textTd = document.createElement('td');
if (item.type === 'speech') {
let txt = item.text.length > 60 ? item.text.substring(0, 57) + '…' : item.text;
textTd.textContent = `"${txt}"`;
} else {
textTd.textContent = `${item.duration}s`;
}
tr.appendChild(textTd);
// Actions column
const actionsTd = document.createElement('td');
actionsTd.classList.add('actions');
const removeBtn = document.createElement('button');
removeBtn.innerHTML = '&times;'; // Unicode multiplication sign (X)
removeBtn.classList.add('remove-dialog-item-btn', 'x-remove-btn');
removeBtn.setAttribute('aria-label', 'Remove dialog line');
removeBtn.title = 'Remove';
removeBtn.onclick = () => {
dialogItems.splice(index, 1);
renderDialogItems();
};
actionsTd.appendChild(removeBtn);
tr.appendChild(actionsTd);
dialogItemsContainer.appendChild(tr);
});
}
const tempInputArea = document.getElementById('temp-input-area');
function clearTempInputArea() {
if (tempInputArea) tempInputArea.innerHTML = '';
}
if (addSpeechLineBtn) {
addSpeechLineBtn.addEventListener('click', async () => {
clearTempInputArea(); // Clear any previous inputs
if (availableSpeakersCache.length === 0) {
try {
availableSpeakersCache = await getSpeakers();
} catch (error) {
alert('Could not load speakers. Please try again.');
console.error('Error fetching speakers for dialog:', error);
return;
}
}
if (availableSpeakersCache.length === 0) {
alert('No speakers available. Please add a speaker first.');
return;
}
const speakerSelectLabel = document.createElement('label');
speakerSelectLabel.textContent = 'Speaker: ';
speakerSelectLabel.htmlFor = 'temp-speaker-select';
const speakerSelect = document.createElement('select');
speakerSelect.id = 'temp-speaker-select';
availableSpeakersCache.forEach(speaker => {
const option = document.createElement('option');
option.value = speaker.id;
option.textContent = speaker.name;
speakerSelect.appendChild(option);
});
const textInputLabel = document.createElement('label');
textInputLabel.textContent = ' Text: ';
textInputLabel.htmlFor = 'temp-speech-text';
const textInput = document.createElement('textarea');
textInput.id = 'temp-speech-text';
textInput.rows = 2;
textInput.placeholder = 'Enter speech text';
const addButton = document.createElement('button');
addButton.textContent = 'Add Speech';
addButton.onclick = () => {
const speakerId = speakerSelect.value;
const text = textInput.value.trim();
if (!speakerId || !text) {
alert('Please select a speaker and enter text.');
return;
}
dialogItems.push({ type: 'speech', speaker_id: speakerId, text: text });
renderDialogItems();
clearTempInputArea();
};
const cancelButton = document.createElement('button');
cancelButton.textContent = 'Cancel';
cancelButton.onclick = clearTempInputArea;
if (tempInputArea) {
tempInputArea.appendChild(speakerSelectLabel);
tempInputArea.appendChild(speakerSelect);
tempInputArea.appendChild(textInputLabel);
tempInputArea.appendChild(textInput);
tempInputArea.appendChild(addButton);
tempInputArea.appendChild(cancelButton);
}
});
}
if (addSilenceLineBtn) {
addSilenceLineBtn.addEventListener('click', () => {
clearTempInputArea(); // Clear any previous inputs
const durationInputLabel = document.createElement('label');
durationInputLabel.textContent = 'Duration (s): ';
durationInputLabel.htmlFor = 'temp-silence-duration';
const durationInput = document.createElement('input');
durationInput.type = 'number';
durationInput.id = 'temp-silence-duration';
durationInput.step = '0.1';
durationInput.min = '0.1';
durationInput.placeholder = 'e.g., 0.5';
const addButton = document.createElement('button');
addButton.textContent = 'Add Silence';
addButton.onclick = () => {
const duration = parseFloat(durationInput.value);
if (isNaN(duration) || duration <= 0) {
alert('Invalid duration. Please enter a positive number.');
return;
}
dialogItems.push({ type: 'silence', duration: duration });
renderDialogItems();
clearTempInputArea();
};
const cancelButton = document.createElement('button');
cancelButton.textContent = 'Cancel';
cancelButton.onclick = clearTempInputArea;
if (tempInputArea) {
tempInputArea.appendChild(durationInputLabel);
tempInputArea.appendChild(durationInput);
tempInputArea.appendChild(addButton);
tempInputArea.appendChild(cancelButton);
}
});
}
if (generateDialogBtn && outputBaseNameInput) {
generateDialogBtn.addEventListener('click', async () => {
const outputBaseName = outputBaseNameInput.value.trim();
if (!outputBaseName) {
alert('Please enter an output base name.');
outputBaseNameInput.focus();
return;
}
if (dialogItems.length === 0) {
alert('Please add at least one speech or silence line to the dialog.');
return;
}
// Clear previous results and show loading/status
if (generationLogPre) generationLogPre.textContent = 'Generating dialog...';
if (audioPlayer) {
audioPlayer.style.display = 'none';
audioPlayer.src = ''; // Clear previous audio source
}
if (downloadZipLink) {
downloadZipLink.style.display = 'none';
downloadZipLink.href = '#';
downloadZipLink.textContent = '';
}
if (zipArchivePlaceholder) zipArchivePlaceholder.style.display = 'block'; // Show placeholder
if (resultsDisplaySection) resultsDisplaySection.style.display = 'block'; // Make sure it's visible
const payload = {
output_base_name: outputBaseName,
dialog_items: dialogItems.map(item => {
// For now, we are not collecting TTS params in the UI for speech items.
// The backend will use defaults. If we add UI for these later, they'd be included here.
if (item.type === 'speech') {
return {
type: item.type,
speaker_id: item.speaker_id,
text: item.text,
// exaggeration: item.exaggeration, // Example for future UI enhancement
// cfg_weight: item.cfg_weight,
// temperature: item.temperature
};
}
return item; // for silence items
})
};
try {
console.log('Generating dialog with payload:', JSON.stringify(payload, null, 2));
const result = await generateDialog(payload);
console.log('Dialog generation successful:', result);
if (generationLogPre) generationLogPre.textContent = result.log || 'No log output.';
if (result.concatenated_audio_url && audioPlayer) { // Check audioPlayer, not audioSource
audioPlayer.src = result.concatenated_audio_url.startsWith('http') ? result.concatenated_audio_url : `${API_BASE_URL_FOR_FILES}${result.concatenated_audio_url}`;
audioPlayer.load(); // Call load() after setting new source
audioPlayer.style.display = 'block';
} else {
if (audioPlayer) audioPlayer.style.display = 'none'; // Ensure it's hidden if no URL
if (generationLogPre) generationLogPre.textContent += '\nNo concatenated audio URL found.';
}
if (result.zip_archive_url && downloadZipLink) {
downloadZipLink.href = result.zip_archive_url.startsWith('http') ? result.zip_archive_url : `${API_BASE_URL_FOR_FILES}${result.zip_archive_url}`;
downloadZipLink.textContent = `Download ${outputBaseName}.zip`;
downloadZipLink.style.display = 'block';
if (zipArchivePlaceholder) zipArchivePlaceholder.style.display = 'none'; // Hide placeholder
} else {
if (downloadZipLink) downloadZipLink.style.display = 'none';
if (zipArchivePlaceholder) zipArchivePlaceholder.style.display = 'block'; // Show placeholder if no link
if (generationLogPre) generationLogPre.textContent += '\nNo ZIP archive URL found.';
}
} catch (error) {
console.error('Dialog generation failed:', error);
if (generationLogPre) generationLogPre.textContent = `Error generating dialog: ${error.message}`;
alert(`Error generating dialog: ${error.message}`);
}
});
}
console.log('Dialog Editor Initialized');
renderDialogItems(); // Initial render (empty)
}
// --- Results Display --- //
function initializeResultsDisplay() {
const generationLogContent = document.getElementById('generation-log-content');
const concatenatedAudioPlayer = document.getElementById('concatenated-audio-player');
const zipArchiveLink = document.getElementById('zip-archive-link');
const zipArchivePlaceholder = document.getElementById('zip-archive-placeholder');
// Functions to update these elements will be called by the generateDialog handler
// e.g., updateLog(message), setAudioSource(url), setZipLink(url)
console.log('Results Display Initialized');
}

196
frontend/tests/api.test.js Normal file
View File

@ -0,0 +1,196 @@
// frontend/tests/api.test.js
// Import the function to test (adjust path if your structure is different)
// We might need to configure Jest or use Babel for ES module syntax if this causes issues.
import { getSpeakers, addSpeaker, deleteSpeaker, generateDialog } from '../js/api.js';
// Mock the global fetch function
global.fetch = jest.fn();
const API_BASE_URL = 'http://localhost:8000/api'; // Centralize for all tests
describe('API Client - getSpeakers', () => {
beforeEach(() => {
// Clear all instances and calls to constructor and all methods:
fetch.mockClear();
});
it('should fetch speakers successfully', async () => {
const mockSpeakers = [{ id: '1', name: 'Speaker 1' }, { id: '2', name: 'Speaker 2' }];
fetch.mockResolvedValueOnce({
ok: true,
json: async () => mockSpeakers,
});
const speakers = await getSpeakers();
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers`);
expect(speakers).toEqual(mockSpeakers);
});
it('should throw an error if the network response is not ok', async () => {
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Not Found',
json: async () => ({ detail: 'Speakers not found' }) // Simulate FastAPI error response
});
await expect(getSpeakers()).rejects.toThrow('Failed to fetch speakers: Speakers not found');
expect(fetch).toHaveBeenCalledTimes(1);
});
it('should throw a generic error if parsing error response fails', async () => {
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Internal Server Error',
json: async () => { throw new Error('Failed to parse error JSON'); } // Simulate error during .json()
});
await expect(getSpeakers()).rejects.toThrow('Failed to fetch speakers: Internal Server Error');
expect(fetch).toHaveBeenCalledTimes(1);
});
it('should throw an error if fetch itself fails (network error)', async () => {
fetch.mockRejectedValueOnce(new TypeError('Network failed'));
await expect(getSpeakers()).rejects.toThrow('Network failed'); // This will be the original fetch error
expect(fetch).toHaveBeenCalledTimes(1);
});
});
describe('API Client - addSpeaker', () => {
beforeEach(() => {
fetch.mockClear();
});
it('should add a speaker successfully', async () => {
const mockFormData = new FormData(); // In a real scenario, this would have data
mockFormData.append('name', 'Test Speaker');
// mockFormData.append('audio_sample_file', new File([''], 'sample.wav')); // File creation in Node test needs more setup or a mock
const mockResponse = { id: '3', name: 'Test Speaker', message: 'Speaker added successfully' };
fetch.mockResolvedValueOnce({
ok: true,
json: async () => mockResponse,
});
const result = await addSpeaker(mockFormData);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers`, {
method: 'POST',
body: mockFormData,
});
expect(result).toEqual(mockResponse);
});
it('should throw an error if adding a speaker fails', async () => {
const mockFormData = new FormData();
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Bad Request',
json: async () => ({ detail: 'Invalid speaker data' }),
});
await expect(addSpeaker(mockFormData)).rejects.toThrow('Failed to add speaker: Invalid speaker data');
expect(fetch).toHaveBeenCalledTimes(1);
});
});
describe('API Client - deleteSpeaker', () => {
beforeEach(() => {
fetch.mockClear();
});
it('should delete a speaker successfully with JSON response', async () => {
const speakerId = 'test-speaker-id-123';
const mockResponse = { message: `Speaker ${speakerId} deleted successfully` };
fetch.mockResolvedValueOnce({
ok: true,
status: 200, // Or any 2xx status that might return JSON
json: async () => mockResponse,
});
const result = await deleteSpeaker(speakerId);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers/${speakerId}`, {
method: 'DELETE',
});
expect(result).toEqual(mockResponse);
});
it('should handle successful deletion with 204 No Content response', async () => {
const speakerId = 'test-speaker-id-204';
fetch.mockResolvedValueOnce({
ok: true,
status: 204,
statusText: 'No Content',
// .json() is not called by the function if status is 204
});
const result = await deleteSpeaker(speakerId);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers/${speakerId}`, {
method: 'DELETE',
});
expect(result).toEqual({ message: `Speaker ${speakerId} deleted successfully.` });
});
it('should throw an error if deleting a speaker fails (e.g., speaker not found)', async () => {
const speakerId = 'non-existent-speaker-id';
fetch.mockResolvedValueOnce({
ok: false,
status: 404,
statusText: 'Not Found',
json: async () => ({ detail: 'Speaker not found' }),
});
await expect(deleteSpeaker(speakerId)).rejects.toThrow(`Failed to delete speaker ${speakerId}: Speaker not found`);
expect(fetch).toHaveBeenCalledTimes(1);
});
});
describe('API Client - generateDialog', () => {
beforeEach(() => {
fetch.mockClear();
});
it('should generate dialog successfully', async () => {
const mockPayload = {
output_base_name: "test_dialog",
dialog_items: [
{ type: "speech", speaker_id: "spk_1", text: "Hello.", exaggeration: 1.0, cfg_weight: 3.0, temperature: 0.5 },
{ type: "silence", duration_ms: 250 }
]
};
const mockResponse = {
log: "Dialog generated.",
concatenated_audio_url: "/audio/test_dialog_concatenated.wav",
zip_archive_url: "/audio/test_dialog.zip"
};
fetch.mockResolvedValueOnce({
ok: true,
json: async () => mockResponse,
});
const result = await generateDialog(mockPayload);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/dialog/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(mockPayload),
});
expect(result).toEqual(mockResponse);
});
it('should throw an error if dialog generation fails', async () => {
const mockPayload = { output_base_name: "fail_dialog", dialog_items: [] }; // Example invalid payload
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Bad Request',
json: async () => ({ detail: 'Invalid dialog data' }),
});
await expect(generateDialog(mockPayload)).rejects.toThrow('Failed to generate dialog: Invalid dialog data');
expect(fetch).toHaveBeenCalledTimes(1);
});
});

5384
package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

23
package.json Normal file
View File

@ -0,0 +1,23 @@
{
"name": "chatterbox-test",
"version": "1.0.0",
"description": "This Gradio application provides a user interface for text-to-speech generation using the Chatterbox TTS model. It supports both single utterance generation and multi-speaker dialog generation with configurable silence gaps.",
"main": "index.js",
"type": "module",
"scripts": {
"test": "jest"
},
"repository": {
"type": "git",
"url": "https://oauth2:78f77aaebb8fa1cd3efbd5b738177c127f7d7d0b@gitea.r8z.us/stwhite/chatterbox-ui.git"
},
"keywords": [],
"author": "",
"license": "ISC",
"devDependencies": {
"@babel/core": "^7.27.4",
"@babel/preset-env": "^7.27.2",
"babel-jest": "^30.0.0-beta.3",
"jest": "^29.7.0"
}
}

View File

@ -0,0 +1,12 @@
831c1dbe-c379-4d9f-868b-9798adc3c05d:
name: Adam
sample_path: speaker_samples/831c1dbe-c379-4d9f-868b-9798adc3c05d.wav
608903c4-b157-46c5-a0ea-4b25eb4b83b6:
name: Denise
sample_path: speaker_samples/608903c4-b157-46c5-a0ea-4b25eb4b83b6.wav
3c93c9df-86dc-4d67-ab55-8104b9301190:
name: Maria
sample_path: speaker_samples/3c93c9df-86dc-4d67-ab55-8104b9301190.wav
fb84ce1c-f32d-4df9-9673-2c64e9603133:
name: Debbie
sample_path: speaker_samples/fb84ce1c-f32d-4df9-9673-2c64e9603133.wav

110
storage_service.py Normal file
View File

@ -0,0 +1,110 @@
"""
Project storage service for saving and loading Chatterbox TTS projects.
"""
import json
import os
import asyncio
from pathlib import Path
from typing import List, Optional
from datetime import datetime
from models import DialogProject, DialogLine
class ProjectStorage:
"""Handles saving and loading projects to/from JSON files."""
def __init__(self, storage_dir: str = "projects"):
self.storage_dir = Path(storage_dir)
self.storage_dir.mkdir(exist_ok=True)
async def save_project(self, project: DialogProject) -> bool:
"""Save a project to a JSON file."""
try:
project_file = self.storage_dir / f"{project.id}.json"
# Convert to dict and ensure timestamps are strings
project_data = project.dict()
project_data["last_modified"] = datetime.now().isoformat()
# Ensure created_at is set if not already
if not project_data.get("created_at"):
project_data["created_at"] = datetime.now().isoformat()
with open(project_file, 'w', encoding='utf-8') as f:
json.dump(project_data, f, indent=2, ensure_ascii=False)
return True
except Exception as e:
print(f"Error saving project {project.id}: {e}")
return False
async def load_project(self, project_id: str) -> Optional[DialogProject]:
"""Load a project from a JSON file."""
try:
project_file = self.storage_dir / f"{project_id}.json"
if not project_file.exists():
return None
with open(project_file, 'r', encoding='utf-8') as f:
project_data = json.load(f)
# Validate that audio files still exist
for line in project_data.get("lines", []):
if line.get("audio_url"):
audio_path = Path("dialog_output") / line["audio_url"].split("/")[-1]
if not audio_path.exists():
line["audio_url"] = None
line["status"] = "pending"
return DialogProject(**project_data)
except Exception as e:
print(f"Error loading project {project_id}: {e}")
return None
async def list_projects(self) -> List[dict]:
"""List all saved projects with metadata."""
projects = []
for project_file in self.storage_dir.glob("*.json"):
try:
with open(project_file, 'r', encoding='utf-8') as f:
project_data = json.load(f)
projects.append({
"id": project_data["id"],
"name": project_data["name"],
"created_at": project_data.get("created_at"),
"last_modified": project_data.get("last_modified"),
"line_count": len(project_data.get("lines", [])),
"has_audio": any(line.get("audio_url") for line in project_data.get("lines", []))
})
except Exception as e:
print(f"Error reading project file {project_file}: {e}")
continue
# Sort by last modified (most recent first)
projects.sort(key=lambda x: x.get("last_modified", ""), reverse=True)
return projects
async def delete_project(self, project_id: str) -> bool:
"""Delete a saved project."""
try:
project_file = self.storage_dir / f"{project_id}.json"
if project_file.exists():
project_file.unlink()
return True
return False
except Exception as e:
print(f"Error deleting project {project_id}: {e}")
return False
async def project_exists(self, project_id: str) -> bool:
"""Check if a project exists in storage."""
project_file = self.storage_dir / f"{project_id}.json"
return project_file.exists()
# Global storage instance
project_storage = ProjectStorage()