Compare commits

...

35 Commits

Author SHA1 Message Date
Steve White 733c9d1b5f Merge pull request 'feat/frontend-phase1' (#1) from feat/frontend-phase1 into main
Reviewed-on: #1
2025-08-14 15:44:24 +00:00
Steve White 9c605cd3a0 docs: update README with Windows setup and Paste Script instructions 2025-08-14 10:42:40 -05:00
Steve White d3ac8bf4eb Added windows setup script 2025-08-14 10:35:30 -05:00
Steve White 75a2a37252 added back end concurrency and front end paste feature. 2025-08-14 10:33:44 -05:00
Steve White b28a9bcf58 fixed some UI issues. 2025-08-14 08:11:16 -05:00
Steve White 4f47d69aaa fixed some UI problems and added a clear dialog button. 2025-08-13 18:10:02 -05:00
Steve White f095bb14e5 Fixed buttons; play/pause, stop, settings 2025-08-13 00:43:43 -05:00
Steve White 93e0407eac frontend: add per-line play/pause/stop controls\n\n- Toggle play/pause on same button, add stop button\n- Maintain shared audio state to prevent overlap and update button states accordingly 2025-08-13 00:28:30 -05:00
Steve White c9593fe6cc frontend: prevent overlapping per-line playback; backend: print idle eviction settings on startup\n\n- app.js: add shared Audio state, disable play button while playing, stop previous line when new one plays\n- start_server.py: print eviction enabled/timeout/check interval\n- app/main.py: log eviction settings during FastAPI startup 2025-08-12 17:37:32 -05:00
Steve White cbc164c7a3 backend: implement idle TTS model eviction\n\n- Add MODEL_EVICTION_ENABLED, MODEL_IDLE_TIMEOUT_SECONDS, MODEL_IDLE_CHECK_INTERVAL_SECONDS in app/config.py\n- Add ModelManager service to manage TTSService load/unload with usage tracking\n- Add background idle reaper in app/main.py (startup/shutdown hooks)\n- Refactor dialog router to use ModelManager dependency instead of per-request load/unload 2025-08-12 16:33:54 -05:00
Steve White 41f95cdee3 feat(frontend): inline notifications and loading states
- Add .notice styles and variants in frontend/css/style.css
- Add showNotice, hideNotice, confirmAction in frontend/js/app.js
- Replace all alert and confirm with inline notices
- Add loading states to Add Speaker and Generate Dialog
- Verified container IDs in index.html, grep clean, tests passing
2025-08-12 15:46:23 -05:00
Steve White b62eb0211f feat(frontend): Phase 1 – normalize speakers endpoints, fix API docs and JSON parsing, consolidate state in app.js, tweak CSS border color, align jest/babel-jest + add jest.config.cjs, add dev scripts, sanitize repo URL 2025-08-12 12:16:23 -05:00
Steve White 948712bb3f current workign version using chatterbox. 2025-08-12 11:31:00 -05:00
Steve White aeb0f7b638 Update README and add new features
- Updated README.md with comprehensive documentation for new multi-interface architecture
- Added cbx-audiobook.py for long-form audiobook generation
- Added import_helper.py utility for dependency management
- Enhanced backend services for dialog processing, speaker management, and TTS
- Updated CLI tools with improved functionality
- Added OpenCode.md and sample files for development

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-24 15:37:02 -05:00
Steve White 2af705ca43 updated with startup script 2025-06-17 16:26:55 -05:00
Steve White 758aa02053 Patched up to work on m3 laptop. Need to fix the location specific shit. 2025-06-07 16:06:38 -05:00
Steve White c91a9598b1 Add API reference 2025-06-06 23:18:47 -05:00
Steve White b37aa56fa6 Added persistence of TTS settings in dialog save/restore. 2025-06-06 11:58:48 -05:00
Steve White f9e952286d Added settings to allow control of exaggeration, cfg_weight, and temperature on each line. 2025-06-06 11:53:43 -05:00
Steve White 26f1d98b46 Fixed Play button to match other icons 2025-06-06 11:35:39 -05:00
Steve White e11a4a091c variablized colors in the .css, tweaked them. Re-arranged buttons. 2025-06-06 11:33:54 -05:00
Steve White 252f885b5a Updated buttons, save/load 2025-06-06 10:36:06 -05:00
Steve White d8eb2492d7 Add dialog script save/load functionality and CLAUDE.md
- Implement save/load buttons in dialog editor interface
- Add JSONL export/import for dialog scripts with validation
- Include timestamp-based filenames for saved scripts
- Add comprehensive error handling and user confirmations
- Create CLAUDE.md with development guidance and architecture overview

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-06 10:05:58 -05:00
Steve White d3ff6e5241 Now uses pre-generated files in concatenated file. 2025-06-06 09:49:04 -05:00
Steve White 4a7c1ea6a1 Added per-line generation and playback; currently regenerates when you hit 'Generate Audio' 2025-06-06 08:44:21 -05:00
Steve White 0261b86ad2 added single line generation to the backend 2025-06-06 08:26:15 -05:00
Steve White 6ccdd18463 Made rows re-orderable 2025-06-06 00:10:36 -05:00
Steve White 1575bf4292 Made speakers drop-down selectable. 2025-06-06 00:05:05 -05:00
Steve White f2f907452b Miscellaneous visual changes 2025-06-06 00:02:00 -05:00
Steve White 9e4fb35800 Working dialog generator 2025-06-05 18:46:09 -05:00
Steve White 6adcadded1 chore: remove node_modules from git tracking and add to .gitignore 2025-06-05 17:40:27 -05:00
Steve White 4a294608b1 Working layout. 2025-06-05 17:38:12 -05:00
Steve White b5db7172cf Working minimum interface for js and api 2025-06-05 16:47:47 -05:00
Steve White 9d1dc330ea Update docs in .noew 2025-06-05 09:22:54 -05:00
Steve White b781d8abcf Updated note directory- gradio interface working. 2025-06-05 09:20:19 -05:00
70 changed files with 13429 additions and 49 deletions

27
.env.example Normal file
View File

@ -0,0 +1,27 @@
# Chatterbox TTS Application Configuration
# Copy this file to .env and adjust values for your environment
# Project paths (adjust these for your system)
PROJECT_ROOT=/path/to/your/chatterbox-ui
SPEAKER_SAMPLES_DIR=${PROJECT_ROOT}/speaker_data/speaker_samples
TTS_TEMP_OUTPUT_DIR=${PROJECT_ROOT}/tts_temp_outputs
DIALOG_GENERATED_DIR=${PROJECT_ROOT}/backend/tts_generated_dialogs
# Backend server configuration
BACKEND_HOST=0.0.0.0
BACKEND_PORT=8000
BACKEND_RELOAD=true
# Frontend development server configuration
FRONTEND_HOST=127.0.0.1
FRONTEND_PORT=8001
# API URLs (usually derived from backend configuration)
API_BASE_URL=http://localhost:8000
API_BASE_URL_WITH_PREFIX=http://localhost:8000/api
# CORS configuration (comma-separated list)
CORS_ORIGINS=http://localhost:8001,http://127.0.0.1:8001,http://localhost:3000,http://127.0.0.1:3000
# Device configuration for TTS model (auto, cpu, cuda, mps)
DEVICE=auto

18
.gitignore vendored
View File

@ -5,3 +5,21 @@ output*.wav
*.mp3
dialog_output/
*.zip
.DS_Store
__pycache__
projects/
# Environment files
.env
.env.local
.env.*.local
backend/.env
frontend/.env
# Generated directories
tts_temp_outputs/
backend/tts_generated_dialogs/
# Node.js dependencies
node_modules/
.aider*

32
.note/code_structure.md Normal file
View File

@ -0,0 +1,32 @@
# Code Structure
*(This document will describe the organization of the codebase as it evolves.)*
## Current (Gradio-based - to be migrated)
- `gradio_app.py`: Main application logic for the Gradio UI.
- `requirements.txt`: Python dependencies.
- `speaker_samples/`: Directory for speaker audio samples.
- `speakers.yaml`: Configuration for speakers.
- `single_output/`: Output directory for single utterance TTS.
- `dialog_output/`: Output directory for dialog TTS.
## Planned (FastAPI + Vanilla JS)
### Backend (FastAPI - Python)
- `main.py`: FastAPI application entry point, router setup.
- `api/`: Directory for API endpoint modules (e.g., `tts_routes.py`, `speaker_routes.py`).
- `core/`: Core logic (e.g., TTS processing, dialog assembly, file management).
- `models/`: Pydantic models for request/response validation.
- `services/`: Business logic services (e.g., `TTSService`, `DialogService`).
- `static/` (or served via CDN): For frontend files if not using a separate frontend server during development.
### Frontend (Vanilla JavaScript)
- `index.html`: Main HTML file.
- `css/`: Stylesheets.
- `style.css`
- `js/`: JavaScript files.
- `app.js`: Main application logic.
- `api.js`: Functions for interacting with the FastAPI backend.
- `uiComponents.js`: Reusable UI components (e.g., DialogLine, AudioPlayer).
- `state.js`: Frontend state management (if needed).
- `assets/`: Static assets like images or icons.

188
.note/concurrency_plan.md Normal file
View File

@ -0,0 +1,188 @@
# Chatterbox TTS Backend: Bounded Concurrency + File I/O Offload Plan
Date: 2025-08-14
Owner: Backend
Status: Proposed (ready to implement)
## Goals
- Increase GPU utilization and reduce wall-clock time for dialog generation.
- Keep model lifecycle stable (leveraging current `ModelManager`).
- Minimal-risk changes: no API shape changes to clients.
## Scope
- Implement bounded concurrency for per-line speech chunk generation within a single dialog request.
- Offload audio file writes to threads to overlap GPU compute and disk I/O.
- Add configuration knobs to tune concurrency.
## Current State (References)
- `backend/app/services/dialog_processor_service.py`
- `DialogProcessorService.process_dialog()` iterates items and awaits `tts_service.generate_speech(...)` sequentially (lines ~171201).
- `backend/app/services/tts_service.py`
- `TTSService.generate_speech()` runs the TTS forward and calls `torchaudio.save(...)` on the event loop thread (blocking).
- `backend/app/services/model_manager.py`
- `ModelManager.using()` tracks active work; prevents idle eviction during requests.
- `backend/app/routers/dialog.py`
- `process_dialog_flow()` expects ordered `segment_files` and then concatenates; good to keep order stable.
## Design Overview
1) Bounded concurrency at dialog level
- Plan all output segments with a stable `segment_idx` (including speech chunks, silence, and reused audio).
- For speech chunks, schedule concurrent async tasks with a global semaphore set by config `TTS_MAX_CONCURRENCY` (start at 34).
- Await all tasks and collate results by `segment_idx` to preserve order.
2) File I/O offload
- Replace direct `torchaudio.save(...)` with `await asyncio.to_thread(torchaudio.save, ...)` in `TTSService.generate_speech()`.
- This lets the next GPU forward start while previous file writes happen on worker threads.
## Configuration
Add to `backend/app/config.py`:
- `TTS_MAX_CONCURRENCY: int` (default: `int(os.getenv("TTS_MAX_CONCURRENCY", "3"))`).
- Optional (future): `TTS_ENABLE_AMP_ON_CUDA: bool = True` to allow mixed precision on CUDA only.
## Implementation Steps
### A. Dialog-level concurrency
- File: `backend/app/services/dialog_processor_service.py`
- Function: `DialogProcessorService.process_dialog()`
1. Planning pass to assign indices
- Iterate `dialog_items` and build a list `planned_segments` entries:
- For silence or reuse: immediately append a final result with assigned `segment_idx` and continue.
- For speech: split into `text_chunks`; for each chunk create a planned entry: `{ segment_idx, type: 'speech', speaker_id, text_chunk, abs_speaker_sample_path, tts_params }`.
- Increment `segment_idx` for every planned segment (speech chunk or silence/reuse) to preserve final order.
2. Concurrency setup
- Create `sem = asyncio.Semaphore(config.TTS_MAX_CONCURRENCY)`.
- For each planned speech segment, create a task with an inner wrapper:
```python
async def run_one(planned):
async with sem:
try:
out_path = await self.tts_service.generate_speech(
text=planned.text_chunk,
speaker_sample_path=planned.abs_speaker_sample_path,
output_filename_base=planned.filename_base,
output_dir=dialog_temp_dir,
exaggeration=planned.exaggeration,
cfg_weight=planned.cfg_weight,
temperature=planned.temperature,
)
return planned.segment_idx, {"type": "speech", "path": str(out_path), "speaker_id": planned.speaker_id, "text_chunk": planned.text_chunk}
except Exception as e:
return planned.segment_idx, {"type": "error", "message": f"Error generating speech: {e}", "text_chunk": planned.text_chunk}
```
- Schedule with `asyncio.create_task(run_one(p))` and collect tasks.
3. Await and collate
- `results_map = {}`; for each completed task, set `results_map[idx] = payload`.
- Merge: start with all previously final (silence/reuse/error) entries placed by `segment_idx`, then fill speech results by `segment_idx` into a single `segment_results` list sorted ascending by index.
- Keep `processing_log` entries for each planned segment (queued, started, finished, errors).
4. Return value unchanged
- Return `{"log": ..., "segment_files": segment_results, "temp_dir": str(dialog_temp_dir)}`. This maintains router and concatenator behavior.
### B. Offload audio writes
- File: `backend/app/services/tts_service.py`
- Function: `TTSService.generate_speech()`
1. After obtaining `wav` tensor, replace:
```python
# torchaudio.save(str(output_file_path), wav, self.model.sr)
```
with:
```python
await asyncio.to_thread(torchaudio.save, str(output_file_path), wav, self.model.sr)
```
- Keep the rest of cleanup logic (delete `wav`, `gc.collect()`, cache emptying) unchanged.
2. Optional (CUDA-only AMP)
- If CUDA is used and `config.TTS_ENABLE_AMP_ON_CUDA` is True, wrap forward with AMP:
```python
with torch.cuda.amp.autocast(dtype=torch.float16):
wav = self.model.generate(...)
```
- Leave MPS/CPU code path as-is.
## Error Handling & Ordering
- Every planned segment owns a unique `segment_idx`.
- On failure, insert an error record at that index; downstream concatenation will skip missing/nonexistent paths already.
- Preserve exact output order expected by `routers/dialog.py::process_dialog_flow()`.
## Performance Expectations
- GPU util should increase from ~50% to 7590% depending on dialog size and line lengths.
- Wall-clock reduction is workload-dependent; target 1.52.5x on multi-line dialogs.
## Metrics & Instrumentation
- Add timestamped log entries per segment: planned→queued→started→saved.
- Log effective concurrency (max in-flight), and cumulative GPU time if available.
- Optionally add a simple timing summary at end of `process_dialog()`.
## Testing Plan
1. Unit-ish
- Small dialog (3 speech lines, 1 silence). Ensure ordering is stable and files exist.
- Introduce an invalid speaker to verify error propagation doesnt break the rest.
2. Integration
- POST `/api/dialog/generate` with 2050 mixed-length lines and a couple silences.
- Validate: response OK, concatenated file exists, zip contains all generated speech segments, order preserved.
- Compare runtime vs. sequential baseline (before/after).
3. Stress/limits
- Long lines split into many chunks; verify no OOM with `TTS_MAX_CONCURRENCY`=3.
- Try `TTS_MAX_CONCURRENCY`=1 to simulate sequential; compare metrics.
## Rollout & Config Defaults
- Default `TTS_MAX_CONCURRENCY=3`.
- Expose via environment variable; no client changes needed.
- If instability observed, set `TTS_MAX_CONCURRENCY=1` to revert to sequential behavior quickly.
## Risks & Mitigations
- OOM under high concurrency → Mitigate with low default, easy rollback, and chunking already in place.
- Disk I/O saturation → Offload to threads; if disk is a bottleneck, decrease concurrency.
- Model thread safety → We call `model.generate` concurrently only up to semaphore cap; if underlying library is not thread-safe for forward passes, consider serializing forwards but still overlapping with file I/O; early logs will reveal.
## Follow-up (Out of Scope for this change)
- Dynamic batching queue inside `TTSService` for further GPU efficiency.
- CUDA AMP enablement and profiling.
- Per-speaker sub-queues if batching requires same-speaker inputs.
## Acceptance Criteria
- `TTS_MAX_CONCURRENCY` is configurable; default=3.
- File writes occur via `asyncio.to_thread`.
- Order of `segment_files` unchanged relative to sequential output.
- End-to-end works for both small and large dialogs; error cases logged.
- Observed GPU utilization and runtime improve on representative dialog.

23
.note/current_focus.md Normal file
View File

@ -0,0 +1,23 @@
# Chatterbox TTS Migration: Backend Development (FastAPI)
**Primary Goal:** Implement the FastAPI backend for TTS dialog generation.
**Recent Accomplishments (Phase 1, Step 2 - Speaker Management):**
- Created Pydantic models for speaker data (`speaker_models.py`).
- Implemented `SpeakerManagementService` (`speaker_service.py`) for CRUD operations on speakers (metadata in `speakers.yaml`, samples in `speaker_samples/`).
- Created FastAPI router (`routers/speakers.py`) with endpoints: `GET /api/speakers`, `POST /api/speakers`, `GET /api/speakers/{id}`, `DELETE /api/speakers/{id}`.
- Integrated speaker router into the main FastAPI app (`main.py`).
- Successfully tested all speaker API endpoints using `curl`.
**Current Task (Phase 1, Step 3 - TTS Core):**
- **Develop `TTSService` in `backend/app/services/tts_service.py`.**
- Focus on `ChatterboxTTS` model loading, inference, and critical memory management.
- Define methods for speech generation using speaker samples.
- Manage TTS parameters (exaggeration, cfg_weight, temperature).
**Next Immediate Steps:**
1. Finalize and test the initial implementation of `TTSService`.
2. Proceed to Phase 1, Step 4: Dialog Processing - Implement `DialogProcessorService` including text splitting logic.

22
.note/decision_log.md Normal file
View File

@ -0,0 +1,22 @@
# Decision Log
This log records key decisions made throughout the project, along with their rationale.
---
**Date:** 2025-06-05
**Decision ID:** 20250605-001
**Decision:** Adopt the `.note/` Memory Bank system for project documentation and context management.
**Rationale:** As per user's global development standards (MEMORY[user_global]) to ensure persistent knowledge and effective collaboration, especially given potential agent memory resets.
**Impact:** Creation of standard `.note/` files (`project_overview.md`, `current_focus.md`, etc.). All significant project information, decisions, and progress will be logged here.
---
**Date:** 2025-06-05
**Decision ID:** 20250605-002
**Decision:** Created a detailed migration plan for moving from Gradio to FastAPI & Vanilla JS.
**Rationale:** Based on a thorough review of `gradio_app.py` and the user's request, a detailed, phased plan was necessary to guide development. This incorporates key findings about TTS model management, text processing, and output requirements.
**Impact:** The plan is stored in `.note/detailed_migration_plan.md`. `current_focus.md` has been updated to reflect this. Development will follow this plan upon user approval.
**Related Memory:** MEMORY[b82cdf38-f0b9-45cd-8097-5b1b47030a40] (System memory of the plan)
---

View File

@ -0,0 +1,98 @@
# Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan
This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from `gradio_app.py` and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).
## 1. Backend (FastAPI) Development
### Objective
Create a robust API to handle TTS generation, speaker management, and file delivery.
### Key Modules/Components
* **API Endpoints:**
* `POST /api/dialog/generate`:
* **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`.
* **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`.
* `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`).
* `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`.
* `DELETE /api/speakers/{speaker_id}`: Removes a speaker.
* **Core Logic & Services:**
* `TTSService`:
* Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup).
* Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults).
* Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept).
* `DialogProcessorService`:
* Orchestrates dialog generation using `TTSService`.
* Implements `split_text_at_sentence_boundaries` logic for long text inputs.
* Manages generation of individual audio segments.
* `AudioManipulationService`:
* Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences.
* Creates ZIP archives of all generated audio files using `zipfile`.
* `SpeakerManagementService`:
* Manages `speakers.yaml` (or alternative storage) for speaker metadata.
* Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`).
* **File Handling:**
* Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage).
### Implementation Steps (Phase 1)
1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`).
2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints.
3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management.
4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting.
5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping.
6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services.
7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings.
8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`.
## 2. Frontend (Vanilla JavaScript) Development
### Objective
Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.
### Key Modules/Components
* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display.
* **CSS (`style.css`):** Styling for a clean and usable interface.
* **JavaScript (`app.js`, `api.js`, `ui.js`):
* `api.js`: Functions for all backend API communications (`fetch`).
* `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
* `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data).
### Implementation Steps (Phase 2)
1. **Basic Layout:** Create `index.html` and `style.css`.
2. **API Client:** Develop `api.js` to interface with all backend endpoints.
3. **Speaker UI:**
* Fetch and display speakers using `ui.js` and `api.js`.
* Implement forms and logic for adding (with file upload) and removing speakers.
4. **Dialog Editor UI:**
* Dynamically add/remove/reorder dialog lines (speech/silence).
* Inputs for speaker selection (populated from API), text, and silence duration.
* Input for `output_base_name`.
5. **Interaction & Results:**
* "Generate Dialog" button to submit data via `api.js`.
* Display generation log, audio player for concatenated output, and download link for ZIP file.
## 3. Integration & Testing (Phase 3)
1. **Full System Connection:** Ensure seamless frontend-backend communication.
2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions.
3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed.
4. **UX Refinement:** Iterate on UI/UX based on testing feedback.
## 4. Advanced Features & Deployment (Phase 4)
* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
* **Real-time Updates:** Consider WebSockets for live progress during generation.
* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets.
## Key Considerations from `gradio_app.py` Analysis
* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly.
* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated.
* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend.
* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API.
* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.

View File

@ -0,0 +1,21 @@
# Development Standards
*(To be defined. This document will outline coding conventions, patterns, and best practices for the project.)*
## General Principles
- **Clarity and Readability:** Code should be easy to understand and maintain.
- **Modularity:** Design components with clear responsibilities and interfaces.
- **Testability:** Write code that is easily testable.
## Python (FastAPI Backend)
- Follow PEP 8 style guidelines.
- Use type hints.
- Structure API endpoints logically.
## JavaScript (Vanilla JS Frontend)
- Follow modern JavaScript best practices (ES6+).
- Organize code into modules.
- Prioritize performance and responsiveness.
## Commit Messages
- Follow conventional commit message format (e.g., `feat: add new TTS feature`, `fix: resolve audio playback bug`).

88
.note/interfaces.md Normal file
View File

@ -0,0 +1,88 @@
# Component Interfaces
*(This document will define the interfaces between different components of the system, especially between the frontend and backend.)*
## Backend API (FastAPI)
*(To be detailed. Examples below)*
### `/api/tts/generate_single` (POST)
- **Request Body:**
```json
{
"text": "string",
"speaker_id": "string",
"temperature": "float (optional)",
"length_penalty": "float (optional)"
}
```
- **Response Body (Success):**
```json
{
"audio_url": "string (URL to the generated audio file)",
"duration_ms": "integer"
}
```
- **Response Body (Error):**
```json
{
"detail": "string (error message)"
}
```
### `/api/tts/generate_dialog` (POST)
- **Request Body:**
```json
{
"dialog_lines": [
{
"type": "speech", // or "silence"
"speaker_id": "string (required if type is speech)",
"text": "string (required if type is speech)",
"duration_s": "float (required if type is silence)"
}
],
"output_base_name": "string (optional)"
}
```
- **Response Body (Success):**
```json
{
"dialog_audio_url": "string (URL to the concatenated dialog audio file)",
"individual_files_zip_url": "string (URL to zip of individual lines)",
"total_duration_ms": "integer"
}
```
### `/api/speakers` (GET)
- **Response Body (Success):**
```json
[
{
"id": "string",
"name": "string",
"sample_url": "string (optional)"
}
]
```
### `/api/speakers` (POST)
- **Request Body:** (Multipart form-data)
- `name`: "string"
- `audio_sample`: file (WAV)
- **Response Body (Success):**
```json
{
"id": "string",
"name": "string",
"message": "Speaker added successfully"
}
```
## Frontend Components (Vanilla JS)
*(To be detailed as frontend development progresses.)*
- **DialogLine Component:** Manages input for a single line of dialog (speaker, text).
- **AudioPlayer Component:** Handles playback of generated audio.
- **ProjectManager Component:** Manages overall project state, dialog lines, and interaction with the backend.

42
.note/project_overview.md Normal file
View File

@ -0,0 +1,42 @@
# Project Overview: Chatterbox TTS Application Migration
## 1. Current System
The project is currently a Gradio-based application named "Chatterbox TTS Gradio App".
Its primary function is to provide a user interface for text-to-speech (TTS) generation using the Chatterbox TTS model.
Key features of the current Gradio application include:
- Single utterance TTS generation.
- Multi-speaker dialog generation with configurable silence gaps.
- Speaker management (adding/removing speakers with custom audio samples).
- Automatic memory optimization (model cleanup after generation).
- Organized output file storage (`single_output/` and `dialog_output/`).
## 2. Project Goal: Migration to Modern Web Stack
The primary goal of this project is to re-implement the Chatterbox TTS application, specifically its dialog generation capabilities, by migrating from the current Gradio framework to a new architecture.
The new architecture will consist of:
- **Frontend**: Vanilla JavaScript
- **Backend**: FastAPI (Python)
This migration aims to address limitations of the Gradio framework, such as audio playback issues, limited UI control, and state management complexity, and to provide a more robust, performant, and professional user experience.
## 3. High-Level Plan & Existing Documentation
A comprehensive implementation plan for this migration already exists and should be consulted. This plan (Memory ID c20c2cce-46d4-453f-9bc3-c18e05dbc66f) outlines:
- A 4-phase implementation (Backend API, Frontend Development, Integration & Testing, Production Features).
- The complete technical architecture.
- A detailed component system (DialogLine, AudioPlayer, ProjectManager).
- Features like real-time status updates and drag-and-drop functionality.
- Migration strategies.
- Expected benefits (e.g., faster responsiveness, better audio reliability).
- An estimated timeline.
## 4. Scope of Current Work
The immediate next step, as requested by the user, is to:
1. Review the existing `gradio_app.py`.
2. Refine or detail the plan for re-implementing the dialog generation functionality with the new stack, leveraging the existing comprehensive plan.
This document will be updated as the project progresses to reflect new decisions, architectural changes, and milestones.

138
.note/review-20250812.md Normal file
View File

@ -0,0 +1,138 @@
# Frontend Review and Recommendations
Date: 2025-08-12T11:32:16-05:00
Scope: `frontend/` of `chatterbox-test` monorepo
---
## Summary
- Static vanilla JS frontend served by `frontend/start_dev_server.py` interacting with FastAPI backend under `/api`.
- Solid feature set (speaker management, dialog editor, per-line generation, full dialog generation, save/load) with robust error handling.
- Key issues: inconsistent API trailing slashes, Jest/babel-jest version/config mismatch, minor state duplication, alert/confirm UX, overly dark border color, token in `package.json` repo URL.
---
## Findings
- **Framework/structure**
- `frontend/` is static vanilla JS. Main files:
- `index.html`, `js/app.js`, `js/api.js`, `js/config.js`, `css/style.css`.
- Dev server: `frontend/start_dev_server.py` (CORS, env-based port/host).
- **API client vs backend routes (trailing slashes)**
- Frontend `frontend/js/api.js` currently uses:
- `getSpeakers()`: `${API_BASE_URL}/speakers/` (trailing).
- `addSpeaker()`: `${API_BASE_URL}/speakers/` (trailing).
- `deleteSpeaker()`: `${API_BASE_URL}/speakers/${speakerId}/` (trailing).
- `generateLine()`: `${API_BASE_URL}/dialog/generate_line`.
- `generateDialog()`: `${API_BASE_URL}/dialog/generate`.
- Backend routes:
- `backend/app/routers/speakers.py`: `GET/POST /` and `DELETE /{speaker_id}` (no trailing slash on delete when prefixed under `/api/speakers`).
- `backend/app/routers/dialog.py`: `/generate_line` and `/generate` (match frontend).
- Tests in `frontend/tests/api.test.js` expect no trailing slashes for `/speakers` and `/speakers/{id}`.
- Implication: Inconsistent trailing slashes can cause test failures and possible 404s for delete.
- **Payload schema inconsistencies**
- `generateDialog()` JSDoc shows `silence` as `{ duration_ms: 500 }` but backend expects `duration` (seconds). UI also uses `duration` seconds.
- **Form fields alignment**
- Speaker add uses `name` and `audio_file` which match backend (`Form` and `File`).
- **State management duplication in `frontend/js/app.js`**
- `dialogItems` and `availableSpeakersCache` defined at module scope and again inside `initializeDialogEditor()`, creating shadowing risk. Consolidate to a single source of truth.
- **UX considerations**
- Heavy use of `alert()`/`confirm()`. Prefer inline notifications/banners and per-row error chips (you already render `item.error`).
- Add global loading/disabled states for long actions (e.g., full dialog generation, speaker add/delete).
- **CSS theme issue**
- `--border-light` is `#1b0404` (dark red); semantically a light gray fits better and improves contrast harmony.
- **Testing/Jest/Babel config**
- Root `package.json` uses `jest@^29.7.0` with `babel-jest@^30.0.0-beta.3` (major mismatch). Align versions.
- No `jest.config.cjs` to configure `transform` via `babel-jest` for ESM modules.
- **Security**
- `package.json` `repository.url` embeds a token. Remove secrets from VCS immediately.
- **Dev scripts**
- Only `"test": "jest"` present. Add scripts to run the frontend dev server and test config explicitly.
- **Response handling consistency**
- `generateLine()` parses via `response.text()` then `JSON.parse()`. Others use `response.json()`. Standardize for consistency.
---
## Recommended Actions (Phase 1: Quick wins)
- **Normalize API paths in `frontend/js/api.js`**
- Use no trailing slashes:
- `GET/POST`: `${API_BASE_URL}/speakers`
- `DELETE`: `${API_BASE_URL}/speakers/${speakerId}`
- Keep dialog endpoints unchanged.
- **Fix JSDoc for `generateDialog()`**
- Use `silence: { duration: number }` (seconds), not `duration_ms`.
- **Refactor `frontend/js/app.js` state**
- Remove duplicate `dialogItems`/`availableSpeakersCache` declarations. Choose module-scope or function-scope, and pass references.
- **Improve UX**
- Replace `alert/confirm` with inline banners near `#results-display` and per-row error chips (extend existing `.line-error-msg`).
- Add disabled/loading states for global generate and speaker actions.
- **CSS tweak**
- Set `--border-light: #e5e7eb;` (or similar) to reflect a light border.
- **Harden tests/Jest config**
- Align versions: either Jest 29 + `babel-jest` 29, or upgrade both to 30 stable together.
- Add `jest.config.cjs` with `transform` using `babel-jest` and suitable `testEnvironment`.
- Ensure tests expect normalized API paths (recommended to change code to match tests).
- **Dev scripts**
- Add to root `package.json`:
- `"frontend:dev": "python3 frontend/start_dev_server.py"`
- `"test:frontend": "jest --config ./jest.config.cjs"`
- **Sanitize repository URL**
- Remove embedded token from `package.json`.
- **Standardize response parsing**
- Switch `generateLine()` to `response.json()` unless backend returns `text/plain`.
---
## Backend Endpoint Confirmation
- `speakers` router (`backend/app/routers/speakers.py`):
- List/Create: `GET /`, `POST /` (when mounted under `/api/speakers``/api/speakers/`).
- Delete: `DELETE /{speaker_id}` (→ `/api/speakers/{speaker_id}`), no trailing slash.
- `dialog` router (`backend/app/routers/dialog.py`):
- `POST /generate_line`, `POST /generate` (mounted under `/api/dialog`).
---
## Proposed Implementation Plan
- **Phase 1 (12 hours)**
- Normalize API paths in `api.js`.
- Fix JSDoc for `generateDialog`.
- Consolidate dialog state in `app.js`.
- Adjust `--border-light` to light gray.
- Add `jest.config.cjs`, align Jest/babel-jest versions.
- Add dev/test scripts.
- Remove token from `package.json`.
- **Phase 2 (24 hours)**
- Inline notifications and comprehensive loading/disabled states.
- **Phase 3 (optional)**
- ESLint + Prettier.
- Consider Vite migration (HMR, proxy to backend, improved DX).
---
## Notes
- Current local time captured for this review: 2025-08-12T11:32:16-05:00.
- Frontend config (`frontend/js/config.js`) supports env overrides for API base and dev server port.
- Tests (`frontend/tests/api.test.js`) currently assume endpoints without trailing slashes.

46
.note/session_log.md Normal file
View File

@ -0,0 +1,46 @@
# Session Log
---
**Session Start:** 2025-06-05 (Continued)
**Goal:** Progress Phase 1 of Chatterbox TTS backend migration: Initial Project Setup.
**Key Activities & Insights:**
- Created `backend/app/main.py` with a basic FastAPI application instance.
- Confirmed user has an existing `.venv` at the project root.
- Updated `backend/README.md` to reflect usage of the root `.venv` instead of a backend-specific one.
- Adjusted venv activation paths and command execution locations (project root).
- Installed backend dependencies from `backend/requirements.txt` into the root `.venv`.
- Successfully ran the basic FastAPI server using `uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000` from the project root.
- Verified the API is accessible.
- Confirmed all Memory Bank files are present. Reviewed `current_focus.md` and `session_log.md`.
**Next Steps:**
- Update `current_focus.md` and `session_log.md`.
- Proceed to Phase 1, Step 2: Speaker Management.
---
---
**Session Start:** 2025-06-05
**Goal:** Initiate migration of Chatterbox TTS dialog generator from Gradio to Vanilla JS + FastAPI.
**Key Activities & Insights:**
- User requested review of `gradio_app.py` and a plan for re-implementation.
- Checked for `.note/` Memory Bank directory (MEMORY[user_global]).
- Directory not found.
- Read `README.md` to gather project context.
- Created `.note/` directory and populated standard files:
- `project_overview.md` (with initial content based on README and user request).
- `current_focus.md` (outlining immediate tasks).
- `development_standards.md` (template).
- `decision_log.md` (logged decision to use Memory Bank).
- `code_structure.md` (initial thoughts on current and future structure).
- `session_log.md` (this entry).
- `interfaces.md` (template).
**Next Steps:**
- Confirm Memory Bank setup with the user.
- Proceed to review `gradio_app.py`.
---

204
.note/unload_model_plan.md Normal file
View File

@ -0,0 +1,204 @@
# Unload Model on Idle: Implementation Plan
## Goals
- Automatically unload large TTS model(s) when idle to reduce RAM/VRAM usage.
- Lazy-load on demand without breaking API semantics.
- Configurable timeout and safety controls.
## Requirements
- Config-driven idle timeout and poll interval.
- Thread-/async-safe across concurrent requests.
- No unload while an inference is in progress.
- Clear logs and metrics for load/unload events.
## Configuration
File: `backend/app/config.py`
- Add:
- `MODEL_IDLE_TIMEOUT_SECONDS: int = 900` (0 disables eviction)
- `MODEL_IDLE_CHECK_INTERVAL_SECONDS: int = 60`
- `MODEL_EVICTION_ENABLED: bool = True`
- Bind to env: `MODEL_IDLE_TIMEOUT_SECONDS`, `MODEL_IDLE_CHECK_INTERVAL_SECONDS`, `MODEL_EVICTION_ENABLED`.
## Design
### ModelManager (Singleton)
File: `backend/app/services/model_manager.py` (new)
- Responsibilities:
- Manage lifecycle (load/unload) of the TTS model/pipeline.
- Provide `get()` that returns a ready model (lazy-load if needed) and updates `last_used`.
- Track active request count to block eviction while > 0.
- Internals:
- `self._model` (or components), `self._last_used: float`, `self._active: int`.
- Locks: `asyncio.Lock` for load/unload; `asyncio.Lock` or `asyncio.Semaphore` for counters.
- Optional CUDA cleanup: `torch.cuda.empty_cache()` after unload.
- API:
- `async def get(self) -> Model`: ensures loaded; bumps `last_used`.
- `async def load(self)`: idempotent; guarded by lock.
- `async def unload(self)`: only when `self._active == 0`; clears refs and caches.
- `def touch(self)`: update `last_used`.
- Context helper: `async def using(self)`: async context manager incrementing/decrementing `active` safely.
### Idle Reaper Task
Registration: FastAPI startup (e.g., in `backend/app/main.py`)
- Background task loop every `MODEL_IDLE_CHECK_INTERVAL_SECONDS`:
- If eviction enabled and timeout > 0 and model is loaded and `active == 0` and `now - last_used >= timeout`, call `unload()`.
- Handle cancellation on shutdown.
### API Integration
- Replace direct model access in endpoints with:
```python
manager = ModelManager.instance()
async with manager.using():
model = await manager.get()
# perform inference
```
- Optionally call `manager.touch()` at request start for non-inference paths that still need the model resident.
## Pseudocode
```python
# services/model_manager.py
import time, asyncio
from typing import Optional
from .config import settings
class ModelManager:
_instance: Optional["ModelManager"] = None
def __init__(self):
self._model = None
self._last_used = time.time()
self._active = 0
self._lock = asyncio.Lock()
self._counter_lock = asyncio.Lock()
@classmethod
def instance(cls):
if not cls._instance:
cls._instance = cls()
return cls._instance
async def load(self):
async with self._lock:
if self._model is not None:
return
# ... load model/pipeline here ...
self._model = await load_pipeline()
self._last_used = time.time()
async def unload(self):
async with self._lock:
if self._model is None:
return
if self._active > 0:
return # safety: do not unload while in use
# ... free resources ...
self._model = None
try:
import torch
torch.cuda.empty_cache()
except Exception:
pass
async def get(self):
if self._model is None:
await self.load()
self._last_used = time.time()
return self._model
async def _inc(self):
async with self._counter_lock:
self._active += 1
async def _dec(self):
async with self._counter_lock:
self._active = max(0, self._active - 1)
self._last_used = time.time()
def last_used(self):
return self._last_used
def is_loaded(self):
return self._model is not None
def active(self):
return self._active
def using(self):
manager = self
class _Ctx:
async def __aenter__(self):
await manager._inc()
return manager
async def __aexit__(self, exc_type, exc, tb):
await manager._dec()
return _Ctx()
# main.py (startup)
@app.on_event("startup")
async def start_reaper():
async def reaper():
while True:
try:
await asyncio.sleep(settings.MODEL_IDLE_CHECK_INTERVAL_SECONDS)
if not settings.MODEL_EVICTION_ENABLED:
continue
timeout = settings.MODEL_IDLE_TIMEOUT_SECONDS
if timeout <= 0:
continue
m = ModelManager.instance()
if m.is_loaded() and m.active() == 0 and (time.time() - m.last_used()) >= timeout:
await m.unload()
except asyncio.CancelledError:
break
except Exception as e:
logger.exception("Idle reaper error: %s", e)
app.state._model_reaper_task = asyncio.create_task(reaper())
@app.on_event("shutdown")
async def stop_reaper():
task = getattr(app.state, "_model_reaper_task", None)
if task:
task.cancel()
with contextlib.suppress(Exception):
await task
```
```
## Observability
- Logs: model load/unload, reaper decisions, active count.
- Metrics (optional): counters and gauges (load events, active, residency time).
## Safety & Edge Cases
- Avoid unload when `active > 0`.
- Guard multiple loads/unloads with lock.
- Multi-worker servers: each worker manages its own model.
- Cold-start latency: document expected additional latency for first request after idle unload.
## Testing
- Unit tests for `ModelManager`: load/unload idempotency, counter behavior.
- Simulated reaper triggering with short timeouts.
- Endpoint tests: concurrency (N simultaneous inferences), ensure no unload mid-flight.
## Rollout Plan
1. Introduce config + Manager (no reaper), switch endpoints to `using()`.
2. Enable reaper with long timeout in staging; observe logs/metrics.
3. Tune timeout; enable in production.
## Tasks Checklist
- [ ] Add config flags and defaults in `backend/app/config.py`.
- [ ] Create `backend/app/services/model_manager.py`.
- [ ] Register startup/shutdown reaper in app init (`backend/app/main.py`).
- [ ] Refactor endpoints to use `ModelManager.instance().using()` and `get()`.
- [ ] Add logs and optional metrics.
- [ ] Add unit/integration tests.
- [ ] Update README/ops docs.
## Alternatives Considered
- Gunicorn/uvicorn worker preloading with external idle supervisor: more complexity, less portability.
- OS-level cgroup memory pressure eviction: opaque and risky for correctness.
## Configuration Examples
```
MODEL_EVICTION_ENABLED=true
MODEL_IDLE_TIMEOUT_SECONDS=900
MODEL_IDLE_CHECK_INTERVAL_SECONDS=60
```

0
.opencode/init Normal file
View File

BIN
.opencode/opencode.db Normal file

Binary file not shown.

BIN
.opencode/opencode.db-shm Normal file

Binary file not shown.

BIN
.opencode/opencode.db-wal Normal file

Binary file not shown.

50
AGENTS.md Normal file
View File

@ -0,0 +1,50 @@
# Agent Guidelines for Chatterbox-UI
## Build/Test Commands
```bash
# Backend (FastAPI)
pip install -r backend/requirements.txt
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
python backend/run_api_test.py # Run all backend tests
# Frontend
npm test # Run all frontend tests
npx jest frontend/tests/api.test.js # Run single test file
# Alternative UI
python gradio_app.py # Run Gradio interface
```
## Code Style Guidelines
### Python
- Use type hints (from typing import Optional, List, etc.)
- Exception handling: Use try/except with specific exceptions
- Async/await for FastAPI endpoints and services
- Docstrings for functions and classes
- Use pathlib.Path for file operations
- Organize code into routers, models, and services
### JavaScript
- ES6 modules with import/export
- JSDoc comments for functions
- Async/await for API calls
- Proper error handling with detailed messages
- Descriptive variable and function names
- Consistent error handling pattern in API calls
### Error Handling
- Backend: Raise specific exceptions, use try/except/finally
- Frontend: Use try/catch with detailed error messages
- Always include error details in API responses
### Naming Conventions
- Python: snake_case for variables/functions, PascalCase for classes
- JavaScript: camelCase for variables/functions
- Descriptive, intention-revealing names
### Architecture Notes
- Backend: FastAPI on port 8000, structured as routers/models/services
- Frontend: Vanilla JS (ES6+) on port 8001, modular design
- API Base URL: http://localhost:8000/api
- Speaker data in YAML format at speaker_data/speakers.yaml

518
API_REFERENCE.md Normal file
View File

@ -0,0 +1,518 @@
# Chatterbox TTS API Reference
## Overview
The Chatterbox TTS API is a FastAPI-based backend service that provides text-to-speech capabilities with speaker management and dialog generation features. The API supports creating custom speakers from audio samples and generating complex dialogs with multiple speakers, silences, and fine-tuned TTS parameters.
**Base URL**: `http://127.0.0.1:8000`
**API Version**: 0.1.0
**Framework**: FastAPI with automatic OpenAPI documentation
## Quick Start
- **Interactive API Documentation**: `http://127.0.0.1:8000/docs` (Swagger UI)
- **Alternative Documentation**: `http://127.0.0.1:8000/redoc` (ReDoc)
- **OpenAPI Schema**: `http://127.0.0.1:8000/openapi.json`
## Authentication
Currently, the API does not require authentication. CORS is configured to allow requests from `localhost:8001` and `127.0.0.1:8001`.
---
## Endpoints
### 🏠 Root Endpoint
#### `GET /`
Welcome message and API status check.
**Response:**
```json
{
"message": "Welcome to the Chatterbox TTS API!"
}
```
---
## 👥 Speaker Management
### `GET /api/speakers/`
Retrieve all available speakers.
**Response Model:** `List[Speaker]`
**Example Response:**
```json
[
{
"id": "speaker_001",
"name": "John Doe",
"sample_path": "/path/to/speaker_samples/john_doe.wav"
},
{
"id": "speaker_002",
"name": "Jane Smith",
"sample_path": "/path/to/speaker_samples/jane_smith.wav"
}
]
```
**Status Codes:**
- `200`: Success
---
### `POST /api/speakers/`
Create a new speaker from an audio sample.
**Request Type:** `multipart/form-data`
**Parameters:**
- `name` (form field, required): Speaker name
- `audio_file` (file upload, required): Audio sample file (WAV, MP3, etc.)
**Response Model:** `SpeakerResponse`
**Example Response:**
```json
{
"id": "speaker_003",
"name": "Alex Johnson",
"message": "Speaker added successfully."
}
```
**Status Codes:**
- `201`: Speaker created successfully
- `400`: Invalid file type or missing file
- `500`: Server error during speaker creation
**Example cURL:**
```bash
curl -X POST "http://127.0.0.1:8000/api/speakers/" \
-F "name=Alex Johnson" \
-F "audio_file=@/path/to/sample.wav"
```
---
### `GET /api/speakers/{speaker_id}`
Get details for a specific speaker.
**Path Parameters:**
- `speaker_id` (string, required): Unique speaker identifier
**Response Model:** `Speaker`
**Example Response:**
```json
{
"id": "speaker_001",
"name": "John Doe",
"sample_path": "/path/to/speaker_samples/john_doe.wav"
}
```
**Status Codes:**
- `200`: Success
- `404`: Speaker not found
---
### `DELETE /api/speakers/{speaker_id}`
Delete a speaker by ID.
**Path Parameters:**
- `speaker_id` (string, required): Unique speaker identifier
**Example Response:**
```json
{
"message": "Speaker deleted successfully"
}
```
**Status Codes:**
- `200`: Speaker deleted successfully
- `404`: Speaker not found
---
## 🎭 Dialog Generation
### `POST /api/dialog/generate_line`
Generate audio for a single dialog line (speech or silence).
**Request Body:** Raw JSON object representing either a `SpeechItem` or `SilenceItem`
#### Speech Item Example:
```json
{
"type": "speech",
"speaker_id": "speaker_001",
"text": "Hello, this is a test message.",
"exaggeration": 0.7,
"cfg_weight": 0.6,
"temperature": 0.8,
"use_existing_audio": false,
"audio_url": null
}
```
#### Silence Item Example:
```json
{
"type": "silence",
"duration": 2.0,
"use_existing_audio": false,
"audio_url": null
}
```
**Response:**
```json
{
"audio_url": "/generated_audio/line_abc123def456.wav",
"type": "speech",
"text": "Hello, this is a test message."
}
```
**Status Codes:**
- `200`: Audio generated successfully
- `400`: Invalid request format or unknown dialog item type
- `404`: Speaker not found
- `500`: Server error during generation
---
### `POST /api/dialog/generate`
Generate a complete dialog from multiple speech and silence items.
**Request Model:** `DialogRequest`
**Request Body:**
```json
{
"dialog_items": [
{
"type": "speech",
"speaker_id": "speaker_001",
"text": "Welcome to our podcast!",
"exaggeration": 0.5,
"cfg_weight": 0.5,
"temperature": 0.8
},
{
"type": "silence",
"duration": 1.0
},
{
"type": "speech",
"speaker_id": "speaker_002",
"text": "Thank you for having me!",
"exaggeration": 0.6,
"cfg_weight": 0.7,
"temperature": 0.9
}
],
"output_base_name": "podcast_episode_01"
}
```
**Response Model:** `DialogResponse`
**Example Response:**
```json
{
"log": "Processing dialog with 3 items...\nGenerating speech for item 1...\nGenerating silence for item 2...\nGenerating speech for item 3...\nConcatenating audio segments...\nZIP archive created at: /path/to/output.zip",
"concatenated_audio_url": "/generated_audio/podcast_episode_01_concatenated.wav",
"zip_archive_url": "/generated_audio/podcast_episode_01_archive.zip",
"temp_dir_path": "/path/to/temp/directory",
"error_message": null
}
```
**Status Codes:**
- `200`: Dialog generated successfully
- `400`: Invalid request format or validation errors
- `404`: Speaker or file not found
- `500`: Server error during generation
---
## 📁 Static File Serving
### `GET /generated_audio/{filename}`
Serve generated audio files and ZIP archives.
**Path Parameters:**
- `filename` (string, required): Name of the generated file
**Response:** Binary audio file or ZIP archive
**Example URLs:**
- `http://127.0.0.1:8000/generated_audio/dialog_concatenated.wav`
- `http://127.0.0.1:8000/generated_audio/dialog_archive.zip`
---
## 📋 Data Models
### Speaker Models
#### `Speaker`
```json
{
"id": "string",
"name": "string",
"sample_path": "string|null"
}
```
#### `SpeakerResponse`
```json
{
"id": "string",
"name": "string",
"message": "string|null"
}
```
### Dialog Models
#### `SpeechItem`
```json
{
"type": "speech",
"speaker_id": "string",
"text": "string",
"exaggeration": 0.5, // 0.0-2.0, controls expressiveness
"cfg_weight": 0.5, // 0.0-2.0, alignment with speaker characteristics
"temperature": 0.8, // 0.0-2.0, randomness in generation
"use_existing_audio": false,
"audio_url": "string|null"
}
```
#### `SilenceItem`
```json
{
"type": "silence",
"duration": 1.0, // seconds, must be > 0
"use_existing_audio": false,
"audio_url": "string|null"
}
```
#### `DialogRequest`
```json
{
"dialog_items": [
// Array of SpeechItem and/or SilenceItem objects
],
"output_base_name": "string" // Base name for output files
}
```
#### `DialogResponse`
```json
{
"log": "string", // Processing log
"concatenated_audio_url": "string|null", // URL to final audio
"zip_archive_url": "string|null", // URL to ZIP archive
"temp_dir_path": "string|null", // Server temp directory
"error_message": "string|null" // Error details if failed
}
```
---
## 🎛️ TTS Parameters
### Exaggeration (`exaggeration`)
- **Range**: 0.0 - 2.0
- **Default**: 0.5
- **Description**: Controls the expressiveness of speech. Higher values produce more exaggerated, emotional speech.
### CFG Weight (`cfg_weight`)
- **Range**: 0.0 - 2.0
- **Default**: 0.5
- **Description**: Classifier-Free Guidance weight. Higher values make speech more aligned with the prompt text and speaker characteristics.
### Temperature (`temperature`)
- **Range**: 0.0 - 2.0
- **Default**: 0.8
- **Description**: Controls randomness in generation. Lower values produce more deterministic speech, higher values add more variation.
---
## 🔧 Configuration
### Environment Variables
The API uses the following directory structure (configurable in `app/config.py`):
- **Speaker Samples**: `{PROJECT_ROOT}/speaker_data/speaker_samples/`
- **Generated Audio**: `{PROJECT_ROOT}/backend/tts_generated_dialogs/`
- **Temporary Files**: `{PROJECT_ROOT}/tts_temp_outputs/`
### CORS Settings
- Allowed Origins: `http://localhost:8001`, `http://127.0.0.1:8001` (plus any `FRONTEND_HOST:FRONTEND_PORT` when using `start_servers.py`)
- Allowed Methods: All
- Allowed Headers: All
- Credentials: Enabled
---
## 🚀 Usage Examples
### Python Client Example
```python
import requests
import json
# Base URL
BASE_URL = "http://127.0.0.1:8000"
# Get all speakers
speakers = requests.get(f"{BASE_URL}/api/speakers/").json()
print("Available speakers:", speakers)
# Generate a simple dialog
dialog_request = {
"dialog_items": [
{
"type": "speech",
"speaker_id": speakers[0]["id"],
"text": "Hello world!",
"exaggeration": 0.7,
"cfg_weight": 0.6,
"temperature": 0.9
},
{
"type": "silence",
"duration": 1.0
}
],
"output_base_name": "test_dialog"
}
response = requests.post(
f"{BASE_URL}/api/dialog/generate",
json=dialog_request
)
if response.status_code == 200:
result = response.json()
print("Dialog generated!")
print("Audio URL:", result["concatenated_audio_url"])
print("ZIP URL:", result["zip_archive_url"])
else:
print("Error:", response.text)
```
### JavaScript/Frontend Example
```javascript
// Generate dialog
const dialogRequest = {
dialog_items: [
{
type: "speech",
speaker_id: "speaker_001",
text: "Welcome to our show!",
exaggeration: 0.6,
cfg_weight: 0.5,
temperature: 0.8
}
],
output_base_name: "intro"
};
fetch('http://127.0.0.1:8000/api/dialog/generate', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(dialogRequest)
})
.then(response => response.json())
.then(data => {
console.log('Dialog generated:', data);
// Play the audio
const audio = new Audio(data.concatenated_audio_url);
audio.play();
});
```
---
## ⚠️ Error Handling
### Common Error Responses
#### 400 Bad Request
```json
{
"detail": "Invalid value or configuration: Text cannot be empty"
}
```
#### 404 Not Found
```json
{
"detail": "Speaker sample for ID 'invalid_speaker' not found."
}
```
#### 500 Internal Server Error
```json
{
"detail": "Runtime error during dialog generation: CUDA out of memory"
}
```
### Error Categories
- **Validation Errors**: Invalid input format, missing required fields
- **Resource Errors**: Speaker not found, file not accessible
- **Processing Errors**: TTS model failures, audio processing issues
- **System Errors**: Memory issues, disk space, model loading failures
---
## 🔍 Development & Debugging
### Running the Server
```bash
# From project root
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
```
### API Documentation
- **Swagger UI**: `http://127.0.0.1:8000/docs`
- **ReDoc**: `http://127.0.0.1:8000/redoc`
### Logging
The API provides detailed logging in the `DialogResponse.log` field for dialog generation operations.
### File Management
- Generated files are stored in `backend/tts_generated_dialogs/`
- Temporary processing files are kept for inspection (not auto-deleted)
- ZIP archives contain individual audio segments plus concatenated result
---
## 📝 Notes
- The API automatically loads and unloads TTS models to manage memory usage
- Speaker audio samples should be clear, single-speaker recordings for best results
- Large dialogs may take significant time to process depending on hardware
- Generated files are served statically and persist until manually cleaned up
---
*Generated on: 2025-06-06*
*API Version: 0.1.0*

165
CLAUDE.md Normal file
View File

@ -0,0 +1,165 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Common Development Commands
### Backend (FastAPI)
```bash
# Install backend dependencies (run from project root)
pip install -r backend/requirements.txt
# Run backend development server (run from project root)
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
# Run backend tests
python backend/run_api_test.py
# Backend API accessible at http://127.0.0.1:8000
# API docs at http://127.0.0.1:8000/docs
```
### Frontend Development
```bash
# Install frontend dependencies
npm install
# Run frontend tests
npm test
# Start frontend dev server separately
cd frontend && python start_dev_server.py
```
### Integrated Development Environment
```bash
# Start both backend and frontend servers concurrently
python start_servers.py
# Or alternatively, run backend startup script from backend directory
cd backend && python start_server.py
```
### Command-Line TTS Generation
```bash
# Generate single utterance with CLI
python cbx-generate.py --sample speaker_samples/voice.wav --output output.wav --text "Hello world"
# Generate dialog from script
python cbx-dialog-generate.py --dialog dialog.md --output dialog_output
# Generate audiobook from text file
python cbx-audiobook.py --input book.txt --output audiobook --speaker speaker_name
```
### Alternative Interfaces
```bash
# Run Gradio interface (standalone TTS app)
python gradio_app.py
```
## Architecture Overview
This is a full-stack TTS (Text-to-Speech) application with three interfaces:
1. **Modern web frontend** (vanilla JS) - Interactive dialog editor at `frontend/`
2. **FastAPI backend** - REST API at `backend/`
3. **Gradio interface** - Alternative UI in `gradio_app.py`
### Frontend-Backend Communication
- **Frontend**: Vanilla JS (ES6 modules) serving on port 8001
- **Backend**: FastAPI serving on port 8000
- **API Base**: `http://localhost:8000/api`
- **CORS**: Configured for frontend communication
- **File Serving**: Generated audio served via `/generated_audio/` endpoint
### Key API Endpoints
- `/api/speakers/` - Speaker CRUD operations
- `/api/dialog/generate/` - Full dialog generation
- `/api/dialog/generate_line/` - Single line generation
- `/generated_audio/` - Static audio file serving
### Backend Service Architecture
Located in `backend/app/services/`:
- **TTSService**: Chatterbox TTS model lifecycle management
- **SpeakerManagementService**: Speaker data and sample management
- **DialogProcessorService**: Dialog script to audio processing
- **AudioManipulationService**: Audio concatenation and ZIP creation
### Frontend Architecture
- **Modular design**: `api.js` (API layer) + `app.js` (app logic)
- **No framework**: Modern vanilla JavaScript with ES6+ features
- **Interactive editor**: Table-based dialog creation with drag-drop reordering
### Data Flow
1. User creates dialog in frontend table editor
2. Frontend sends dialog items to `/api/dialog/generate/`
3. Backend processes speech/silence items via services
4. TTS generates audio, segments concatenated
5. ZIP archive created with all outputs
6. Frontend receives URLs for playback/download
### Speaker Configuration
- **Location**: `speaker_data/speakers.yaml` and `speaker_data/speaker_samples/`
- **Format**: YAML config referencing WAV audio samples
- **Management**: Both API endpoints and file-based configuration
### Output Organization
- `dialog_output/` - Generated dialog files
- `single_output/` - Single utterance outputs
- `tts_outputs/` - Raw TTS generation files
- Generated ZIPs contain organized file structure
## Development Setup Notes
- Python virtual environment expected at project root (`.venv`)
- Backend commands run from project root, not `backend/` directory
- Frontend served separately (typically port 8001)
- Speaker samples must be WAV format in `speaker_data/speaker_samples/`
## Environment Configuration
### Quick Setup
```bash
# Run automated setup (creates .env files)
python setup.py
# Install dependencies
pip install -r backend/requirements.txt
npm install
```
### Manual Environment Variables
Key environment variables that can be configured in `.env` files:
- `PROJECT_ROOT`: Base project directory
- `BACKEND_PORT`/`FRONTEND_PORT`: Server ports (default: 8000/8001)
- `DEVICE`: TTS model device (`auto`, `cpu`, `cuda`, `mps`)
- `CORS_ORIGINS`: Allowed frontend origins for CORS
- `SPEAKER_SAMPLES_DIR`: Directory for speaker audio files
### Configuration Files Structure
- `.env`: Global configuration
- `backend/.env`: Backend-specific settings
- `frontend/.env`: Frontend-specific settings
- `speaker_data/speakers.yaml`: Speaker configuration
## CLI Tools Overview
- `cbx-generate.py`: Single utterance generation
- `cbx-dialog-generate.py`: Multi-speaker dialog generation
- `cbx-audiobook.py`: Long-form audiobook generation
- `start_servers.py`: Integrated development server launcher

153
ENVIRONMENT_SETUP.md Normal file
View File

@ -0,0 +1,153 @@
# Environment Configuration Guide
This guide explains how to configure the Chatterbox TTS application for different environments using environment variables.
## Quick Setup
1. **Run the setup script:**
```bash
python setup.py
```
2. **Install backend dependencies:**
```bash
cd backend
pip install -r requirements.txt
```
3. **Start the servers:**
```bash
# Terminal 1 - Backend
cd backend
python start_server.py
# Terminal 2 - Frontend
cd frontend
python start_dev_server.py
```
4. **Open the application:**
Open http://127.0.0.1:8001 in your browser
## Manual Configuration
### Environment Files
The application uses environment variables for configuration. Three `.env` files control different aspects:
- **Root `.env`**: Global configuration
- **`backend/.env`**: Backend-specific settings
- **`frontend/.env`**: Frontend-specific settings
### Key Configuration Options
#### Paths
- `PROJECT_ROOT`: Base directory for the project
- `SPEAKER_SAMPLES_DIR`: Directory containing speaker audio samples
- `TTS_TEMP_OUTPUT_DIR`: Temporary directory for TTS processing
- `DIALOG_GENERATED_DIR`: Directory for generated dialog audio files
#### Server Configuration
- `HOST`: Backend server host (default: 0.0.0.0)
- `PORT`: Backend server port (default: 8000)
- `RELOAD`: Enable auto-reload for development (default: true)
#### Frontend Configuration
- `VITE_API_BASE_URL`: Backend API base URL
- `VITE_DEV_SERVER_PORT`: Frontend development server port
- `VITE_DEV_SERVER_HOST`: Frontend development server host
#### CORS Configuration
- `CORS_ORIGINS`: Comma-separated list of allowed origins. When using `start_servers.py` with the default `FRONTEND_HOST=0.0.0.0` and no explicit `CORS_ORIGINS`, CORS will allow all origins (wildcard) to simplify development.
#### Device Configuration
- `DEVICE`: Device for TTS model (auto, cpu, cuda, mps)
## Example Configurations
### Development Environment
```bash
# .env
PROJECT_ROOT=/Users/yourname/chatterbox-ui
BACKEND_PORT=8000
FRONTEND_PORT=8001
DEVICE=auto
CORS_ORIGINS=http://localhost:8001,http://127.0.0.1:8001
```
### Production Environment
```bash
# .env
PROJECT_ROOT=/opt/chatterbox-ui
BACKEND_HOST=0.0.0.0
BACKEND_PORT=8000
FRONTEND_PORT=3000
DEVICE=cuda
CORS_ORIGINS=https://yourdomain.com
```
### Docker Environment
```bash
# .env
PROJECT_ROOT=/app
BACKEND_HOST=0.0.0.0
BACKEND_PORT=8000
DEVICE=cpu
CORS_ORIGINS=http://localhost:3000
```
## Troubleshooting
### Common Issues
1. **Permission Errors**: Ensure the `PROJECT_ROOT` directory is writable
2. **CORS Errors**: Check that your frontend URL is in `CORS_ORIGINS`. (When using `start_servers.py`, your specified `FRONTEND_HOST:FRONTEND_PORT` will be autoincluded.)
3. **Model Loading Errors**: Verify `DEVICE` setting matches your hardware
4. **Path Errors**: Ensure all path variables point to existing, accessible directories
### Debugging
Enable debug logging by setting:
```bash
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export DEBUG=1
```
### Resetting Configuration
To reset to defaults:
```bash
rm .env backend/.env frontend/.env
python setup.py
```
## File Structure
```
chatterbox-ui/
├── .env # Global configuration
├── .env.example # Template for global config
├── setup.py # Automated setup script
├── backend/
│ ├── .env # Backend configuration
│ ├── .env.example # Template for backend config
│ ├── start_server.py # Backend startup script
│ └── app/
│ └── config.py # Configuration loader
├── frontend/
│ ├── .env # Frontend configuration
│ ├── .env.example # Template for frontend config
│ ├── start_dev_server.py # Frontend dev server
│ └── js/
│ └── config.js # Frontend configuration loader
└── speaker_data/
└── speaker_samples/ # Speaker audio files
```
## Security Notes
- Never commit `.env` files to version control
- Use strong, unique values for production
- Restrict CORS origins in production
- Use HTTPS in production environments
- Regularly update dependencies

62
OpenCode.md Normal file
View File

@ -0,0 +1,62 @@
# OpenCode.md - Chatterbox UI Development Guide
## Build & Run Commands
```bash
# Backend (FastAPI)
pip install -r backend/requirements.txt
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
# Frontend
python frontend/start_dev_server.py # Serves on port 8001
# Run backend tests
python backend/run_api_test.py
# Run frontend tests
npm test
# Run specific frontend test
npx jest frontend/tests/api.test.js
# Run Gradio interface
python gradio_app.py
# Run utility scripts
python cbx-audiobook.py --list-speakers # List available speakers
python cbx-audiobook.py sample-audiobook.txt --speaker <speaker_id> # Generate audiobook
python cbx-dialog-generate.py sample-dialog.md # Generate dialog
```
## Code Style Guidelines
### Python
- Use type hints (from typing import Optional, List, Dict, etc.)
- Error handling: Use try/except with specific exceptions
- Async/await for I/O operations
- Docstrings for functions and classes
- PEP 8 naming: snake_case for functions/variables, PascalCase for classes
### JavaScript
- ES6 modules with import/export
- Async/await for API calls
- JSDoc comments for functions
- Error handling: try/catch with detailed error messages
- Camel case for variables/functions (camelCase)
## Import Structure
- When importing from scripts, use `import import_helper` first to fix Python path
- Backend modules use relative imports within the app package
- Services are in `backend.app.services`
- Models are in `backend.app.models`
- Configuration is in `backend.app.config`
## Project Structure
- Backend: FastAPI with service-oriented architecture
- Frontend: Vanilla JS with modular design (api.js, app.js, config.js)
- Speaker data in YAML format with WAV samples
- Output directories: dialog_output/, single_output/, tts_outputs/
## Common Issues
- Import errors: Make sure to use `import import_helper` in scripts
- Speaker samples must be WAV format in `speaker_data/speaker_samples/`
- TTS model requires GPU (CUDA) or Apple Silicon (MPS)

214
README.md
View File

@ -1,73 +1,203 @@
# Chatterbox TTS Gradio App
# Chatterbox TTS Application
This Gradio application provides a user interface for text-to-speech generation using the Chatterbox TTS model. It supports both single utterance generation and multi-speaker dialog generation with configurable silence gaps.
A comprehensive text-to-speech application with multiple interfaces for generating speech from text using the Chatterbox TTS model. Supports single utterance generation, multi-speaker dialogs, and long-form audiobook generation.
## Features
- **Multiple Interfaces**: Web UI, FastAPI backend, Gradio interface, and CLI tools
- **Single Utterance Generation**: Generate speech from text using a selected speaker
- **Dialog Generation**: Create multi-speaker conversations with configurable silence gaps
- **Audiobook Generation**: Convert long-form text into narrated audiobooks
- **Speaker Management**: Add/remove speakers with custom audio samples
- **Paste Script (JSONL) Import**: Paste a dialog script as JSONL directly into the editor via a modal
- **Memory Optimization**: Automatic model cleanup after generation
- **Output Organization**: Files saved in `single_output/` and `dialog_output/` directories
- **Output Organization**: Files saved in organized directories with ZIP packaging
## Getting Started
1. Clone the repository:
```bash
git clone https://github.com/your-username/chatterbox-test.git
```
### Quick Setup
2. Install dependencies:
1. Clone the repository and install dependencies:
```bash
git clone https://github.com/your-username/chatterbox-ui.git
cd chatterbox-ui
pip install -r requirements.txt
npm install
```
2. Run automated setup:
```bash
python setup.py
```
3. Prepare speaker samples:
- Create a `speaker_samples/` directory
- Add audio samples (WAV format) for each speaker
- Update `speakers.yaml` with speaker names and file paths
- Add audio samples (WAV format) to `speaker_data/speaker_samples/`
- Configure speakers in `speaker_data/speakers.yaml`
4. Run the app:
```bash
python gradio_app.py
```
### Windows Quick Start
On Windows, a PowerShell setup script is provided to automate environment setup and startup.
```powershell
# From the repository root in PowerShell
./setup-windows.ps1
# First time only, if scripts are blocked:
# Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```
What it does:
- Creates/uses `.venv`
- Upgrades pip and installs deps from `backend/requirements.txt` and root `requirements.txt`
- Creates a default `.env` with sensible ports if missing
- Starts both servers via `start_servers.py`
### Running the Application
**Full-Stack Web Application:**
```bash
# Start both backend and frontend servers
python start_servers.py
```
On Windows, you can also use the one-liner PowerShell script:
```powershell
./setup-windows.ps1
```
**Individual Components:**
```bash
# Backend only (FastAPI)
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
# Frontend only
cd frontend && python start_dev_server.py
# Gradio interface
python gradio_app.py
```
## Usage
### Single Utterance Tab
- Select a speaker from the dropdown
- Enter text to synthesize
- Adjust generation parameters as needed
- Click "Generate Speech"
### Web Interface
Access the modern web UI at `http://localhost:8001` for interactive dialog creation.
### Dialog Generation Tab
1. Add speakers using the speaker configuration section
2. Enter dialog in the format:
```
Speaker1: "Hello, how are you?"
Speaker2: "I'm doing well!"
Silence: 0.5
Speaker1: "What are your plans for today?"
```
3. Set output base name
4. Click "Generate Dialog"
#### Paste Script (JSONL) in Dialog Editor
Quickly load a dialog by pasting JSONL (one JSON object per line):
## File Organization
1. Click `Paste Script` in the Dialog Editor.
2. Paste JSONL content, for example:
- Generated single utterances are saved to `single_output/`
- Dialog generation files are saved to `dialog_output/`
- Concatenated dialog files have `_concatenated.wav` suffix
- All files are zipped together for download
```jsonl
{"type":"speech","speaker_id":"dummy_speaker","text":"Hello there!"}
{"type":"silence","duration":0.5}
{"type":"speech","speaker_id":"dummy_speaker","text":"This is the second line."}
```
3. Click `Load` and confirm replacement if prompted.
Notes:
- Input is validated per line; errors report line numbers.
- The dialog is saved to localStorage, so it persists across refreshes.
- Unknown `speaker_id`s will still load; add speakers later if needed.
### CLI Tools
**Single utterance generation:**
```bash
python cbx-generate.py --sample speaker_samples/voice.wav --output output.wav --text "Hello world"
```
**Dialog generation:**
```bash
python cbx-dialog-generate.py --dialog dialog.md --output dialog_output
```
**Audiobook generation:**
```bash
python cbx-audiobook.py --input book.txt --output audiobook --speaker speaker_name
```
### Gradio Interface
- **Single Utterance Tab**: Select speaker, enter text, adjust parameters, generate
- **Dialog Generation Tab**: Configure speakers and create multi-speaker conversations
- Dialog format:
```
Speaker1: "Hello, how are you?"
Speaker2: "I'm doing well!"
Silence: 0.5
Speaker1: "What are your plans for today?"
```
## Architecture Overview
### Application Structure
- **Frontend**: Modern vanilla JavaScript web UI (`frontend/`)
- **Backend**: FastAPI REST API (`backend/`)
- **CLI Tools**: Command-line utilities (`cbx-*.py`)
- **Gradio Interface**: Alternative web UI (`gradio_app.py`)
### New Files and Features
- **`cbx-audiobook.py`**: Generate long-form audiobooks from text files
- **`import_helper.py`**: Utility for managing imports and dependencies
- **Backend Services**: Enhanced dialog processing, speaker management, and TTS services
- **Web Frontend**: Interactive dialog editor with drag-and-drop functionality
### File Organization
- `single_output/` - Single utterance generations
- `dialog_output/` - Multi-speaker dialog files
- `tts_outputs/` - Raw TTS generation files
- `speaker_data/` - Speaker configurations and audio samples
- Generated files packaged in ZIP archives for download
### API Endpoints
- `/api/speakers/` - Speaker CRUD operations
- `/api/dialog/generate/` - Full dialog generation
- `/api/dialog/generate_line/` - Single line generation
- `/generated_audio/` - Static audio file serving
## Configuration
### Environment Setup
Key configuration files:
- `.env` - Global settings
- `backend/.env` - Backend-specific settings
- `frontend/.env` - Frontend-specific settings
- `speaker_data/speakers.yaml` - Speaker configuration
### Development Commands
```bash
# Run tests
python backend/run_api_test.py
npm test
# Backend development
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
# Access points
# Web UI: http://localhost:8001
# API: http://localhost:8000
# API Docs: http://localhost:8000/docs
```
## Memory Management
The app automatically:
The application automatically:
- Cleans up the TTS model after each generation
- Frees GPU memory (for CUDA/MPS devices)
- Deletes intermediate tensors to minimize memory footprint
- Manages GPU memory (CUDA/MPS devices)
- Optimizes memory usage for long-form content
## Troubleshooting
- **"Skipping unknown speaker"**: Add the speaker first using the speaker configuration
- **"Sample file not found"**: Verify the audio file exists in `speaker_samples/`
- **Memory issues**: Try enabling "Re-initialize model each line" for long dialogs
- **"Skipping unknown speaker"**: Configure speaker in `speaker_data/speakers.yaml`
- **"Sample file not found"**: Verify audio files exist in `speaker_data/speaker_samples/`
- **Memory issues**: Use model reinitialization options for long content
- **CORS errors**: Check frontend/backend port configuration (frontend origin is auto-included when using `start_servers.py`)
- **Import errors**: Run `python import_helper.py` to check dependencies
### Windows-specific
- If PowerShell blocks script execution, run once:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```
- If Windows Firewall prompts the first time you run servers, allow access on your private network.

13
babel.config.cjs Normal file
View File

@ -0,0 +1,13 @@
// babel.config.cjs
module.exports = {
presets: [
[
'@babel/preset-env',
{
targets: {
node: 'current', // Target the current version of Node.js
},
},
],
],
};

19
backend/.env.example Normal file
View File

@ -0,0 +1,19 @@
# Backend Configuration
# Copy this file to .env and adjust values as needed
# Project paths
PROJECT_ROOT=/Users/stwhite/CODE/chatterbox-ui
SPEAKER_SAMPLES_DIR=${PROJECT_ROOT}/speaker_data/speaker_samples
TTS_TEMP_OUTPUT_DIR=${PROJECT_ROOT}/tts_temp_outputs
DIALOG_GENERATED_DIR=${PROJECT_ROOT}/backend/tts_generated_dialogs
# Server configuration
HOST=0.0.0.0
PORT=8000
RELOAD=true
# CORS configuration
CORS_ORIGINS=http://localhost:8001,http://127.0.0.1:8001,http://localhost:3000,http://127.0.0.1:3000
# Device configuration (auto, cpu, cuda, mps)
DEVICE=auto

59
backend/README.md Normal file
View File

@ -0,0 +1,59 @@
# Chatterbox TTS Backend
This directory contains the FastAPI backend for the Chatterbox TTS application.
## Project Structure
- `app/`: Contains the main FastAPI application code.
- `__init__.py`: Makes `app` a Python package.
- `main.py`: FastAPI application instance and core API endpoints.
- `services/`: Business logic for TTS, dialog processing, etc.
- `models/`: Pydantic models for API request/response.
- `utils/`: Utility functions.
- `requirements.txt`: Project dependencies for the backend.
- `README.md`: This file.
## Setup & Running
### Prerequisites
- Python 3.8 or higher
- A Python virtual environment (recommended)
### Installation
1. **Navigate to the backend directory**:
```bash
cd /path/to/chatterbox-ui/backend
```
2. **Set up a virtual environment** (if not already created):
```bash
python -m venv .venv
source .venv/bin/activate # On macOS/Linux
# .\.venv\Scripts\activate # On Windows
```
3. **Install dependencies**:
```bash
pip install -r requirements.txt
```
### Running the Development Server
From the `backend` directory, run:
```bash
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
### Accessing the API
Once running, you can access:
- API documentation (Swagger UI): `http://127.0.0.1:8000/docs`
- Alternative API docs (ReDoc): `http://127.0.0.1:8000/redoc`
- API root: `http://127.0.0.1:8000/`
### Development Notes
- The `--reload` flag enables auto-reload on code changes
- The server will be accessible on all network interfaces with `--host 0.0.0.0`
- Default port is 8000, but you can change it with `--port <port_number>`

1
backend/app/__init__.py Normal file
View File

@ -0,0 +1 @@

85
backend/app/config.py Normal file
View File

@ -0,0 +1,85 @@
import os
from pathlib import Path
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# Project root - can be overridden by environment variable
PROJECT_ROOT = Path(
os.getenv("PROJECT_ROOT", Path(__file__).parent.parent.parent)
).resolve()
# Directory paths
SPEAKER_DATA_BASE_DIR = Path(
os.getenv("SPEAKER_DATA_BASE_DIR", str(PROJECT_ROOT / "speaker_data"))
)
SPEAKER_SAMPLES_DIR = Path(
os.getenv("SPEAKER_SAMPLES_DIR", str(SPEAKER_DATA_BASE_DIR / "speaker_samples"))
)
SPEAKERS_YAML_FILE = Path(
os.getenv("SPEAKERS_YAML_FILE", str(SPEAKER_DATA_BASE_DIR / "speakers.yaml"))
)
# TTS temporary output path (used by DialogProcessorService)
TTS_TEMP_OUTPUT_DIR = Path(
os.getenv("TTS_TEMP_OUTPUT_DIR", str(PROJECT_ROOT / "tts_temp_outputs"))
)
# Final dialog output path (used by Dialog router and served by main app)
# These are stored within the 'backend' directory to be easily servable.
DIALOG_OUTPUT_PARENT_DIR = PROJECT_ROOT / "backend"
DIALOG_GENERATED_DIR = Path(
os.getenv(
"DIALOG_GENERATED_DIR", str(DIALOG_OUTPUT_PARENT_DIR / "tts_generated_dialogs")
)
)
# Alias for clarity and backward compatibility
DIALOG_OUTPUT_DIR = DIALOG_GENERATED_DIR
# Server configuration
HOST = os.getenv("HOST", "0.0.0.0")
PORT = int(os.getenv("PORT", "8000"))
RELOAD = os.getenv("RELOAD", "true").lower() == "true"
# CORS configuration: determine allowed origins based on env & frontend binding
_cors_env = os.getenv("CORS_ORIGINS", "")
_frontend_host = os.getenv("FRONTEND_HOST")
_frontend_port = os.getenv("FRONTEND_PORT")
# If the dev server is bound to 0.0.0.0 (all interfaces), allow all origins
if _frontend_host == "0.0.0.0": # dev convenience when binding wildcard
CORS_ORIGINS = ["*"]
elif _cors_env:
# parse comma-separated origins, strip whitespace
CORS_ORIGINS = [origin.strip() for origin in _cors_env.split(",") if origin.strip()]
else:
# default to allow all origins in development
CORS_ORIGINS = ["*"]
# Auto-include specific frontend origin when not using wildcard CORS
if CORS_ORIGINS != ["*"] and _frontend_host and _frontend_port:
_frontend_origin = f"http://{_frontend_host.strip()}:{_frontend_port.strip()}"
if _frontend_origin not in CORS_ORIGINS:
CORS_ORIGINS.append(_frontend_origin)
# Device configuration
DEVICE = os.getenv("DEVICE", "auto")
# Concurrency configuration
# Max number of concurrent TTS generation tasks per dialog request
TTS_MAX_CONCURRENCY = int(os.getenv("TTS_MAX_CONCURRENCY", "3"))
# Model idle eviction configuration
# Enable/disable idle-based model eviction
MODEL_EVICTION_ENABLED = os.getenv("MODEL_EVICTION_ENABLED", "true").lower() == "true"
# Unload model after this many seconds of inactivity (0 disables eviction)
MODEL_IDLE_TIMEOUT_SECONDS = int(os.getenv("MODEL_IDLE_TIMEOUT_SECONDS", "900"))
# How often the reaper checks for idleness
MODEL_IDLE_CHECK_INTERVAL_SECONDS = int(os.getenv("MODEL_IDLE_CHECK_INTERVAL_SECONDS", "60"))
# Ensure directories exist
SPEAKER_SAMPLES_DIR.mkdir(parents=True, exist_ok=True)
TTS_TEMP_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
DIALOG_GENERATED_DIR.mkdir(parents=True, exist_ok=True)

88
backend/app/main.py Normal file
View File

@ -0,0 +1,88 @@
from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles
from fastapi.middleware.cors import CORSMiddleware
from pathlib import Path
import asyncio
import contextlib
import logging
import time
from app.routers import speakers, dialog # Import the routers
from app import config
app = FastAPI(
title="Chatterbox TTS API",
description="API for generating TTS dialogs using Chatterbox TTS.",
version="0.1.0",
)
# CORS Middleware configuration
# For development, we'll allow all origins
# In production, you should restrict this to specific origins
app.add_middleware(
CORSMiddleware,
allow_origins=config.CORS_ORIGINS,
allow_credentials=False,
allow_methods=["*"],
allow_headers=["*"],
expose_headers=["*"]
)
# Include routers
app.include_router(speakers.router, prefix="/api/speakers", tags=["Speakers"])
app.include_router(dialog.router, prefix="/api/dialog", tags=["Dialog Generation"])
@app.get("/")
async def read_root():
return {"message": "Welcome to the Chatterbox TTS API!"}
# Ensure the directory for serving generated audio exists
config.DIALOG_GENERATED_DIR.mkdir(parents=True, exist_ok=True)
# Mount StaticFiles to serve generated dialogs
app.mount("/generated_audio", StaticFiles(directory=config.DIALOG_GENERATED_DIR), name="generated_audio")
# Further endpoints for speakers, dialog generation, etc., will be added here.
# --- Background task: idle model reaper ---
logger = logging.getLogger("app.model_reaper")
@app.on_event("startup")
async def _start_model_reaper():
from app.services.model_manager import ModelManager
async def reaper():
while True:
try:
await asyncio.sleep(config.MODEL_IDLE_CHECK_INTERVAL_SECONDS)
if not getattr(config, "MODEL_EVICTION_ENABLED", True):
continue
timeout = getattr(config, "MODEL_IDLE_TIMEOUT_SECONDS", 0)
if timeout <= 0:
continue
m = ModelManager.instance()
if m.is_loaded() and m.active() == 0 and (time.time() - m.last_used()) >= timeout:
logger.info("Idle timeout reached (%.0fs). Unloading model...", timeout)
await m.unload()
except asyncio.CancelledError:
break
except Exception:
logger.exception("Model reaper encountered an error")
# Log eviction configuration at startup
logger.info(
"Model Eviction -> enabled: %s | idle_timeout: %ss | check_interval: %ss",
getattr(config, "MODEL_EVICTION_ENABLED", True),
getattr(config, "MODEL_IDLE_TIMEOUT_SECONDS", 0),
getattr(config, "MODEL_IDLE_CHECK_INTERVAL_SECONDS", 60),
)
app.state._model_reaper_task = asyncio.create_task(reaper())
@app.on_event("shutdown")
async def _stop_model_reaper():
task = getattr(app.state, "_model_reaper_task", None)
if task:
task.cancel()
with contextlib.suppress(Exception):
await task

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,47 @@
from pydantic import BaseModel, Field, validator
from typing import List, Union, Literal, Optional
class DialogItemBase(BaseModel):
type: str
class SpeechItem(DialogItemBase):
type: Literal['speech'] = 'speech'
speaker_id: str = Field(..., description="ID of the speaker for this speech segment.")
text: str = Field(..., description="Text content to be synthesized.")
exaggeration: Optional[float] = Field(0.5, description="Controls the expressiveness of the speech. Higher values lead to more exaggerated speech. Default from Gradio.")
cfg_weight: Optional[float] = Field(0.5, description="Classifier-Free Guidance weight. Higher values make the speech more aligned with the prompt text and speaker characteristics. Default from Gradio.")
temperature: Optional[float] = Field(0.8, description="Controls randomness in generation. Lower values make speech more deterministic, higher values more varied. Default from Gradio.")
use_existing_audio: Optional[bool] = Field(False, description="If true and audio_url is provided, use the existing audio file instead of generating new audio for this line.")
audio_url: Optional[str] = Field(None, description="Path or URL to pre-generated audio for this line (used if use_existing_audio is true).")
class SilenceItem(DialogItemBase):
type: Literal['silence'] = 'silence'
duration: float = Field(..., gt=0, description="Duration of the silence in seconds.")
use_existing_audio: Optional[bool] = Field(False, description="If true and audio_url is provided, use the existing audio file for silence instead of generating a new silent segment.")
audio_url: Optional[str] = Field(None, description="Path or URL to pre-generated audio for this silence (used if use_existing_audio is true).")
class DialogRequest(BaseModel):
dialog_items: List[Union[SpeechItem, SilenceItem]] = Field(..., description="A list of speech and silence items.")
output_base_name: str = Field(..., description="Base name for the output files (e.g., 'my_dialog_v1'). Extensions will be added automatically.")
@validator('dialog_items', pre=True, each_item=True)
def check_item_type(cls, item):
if not isinstance(item, dict):
raise ValueError("Each dialog item must be a dictionary.")
item_type = item.get('type')
if item_type == 'speech':
# Pydantic will handle further validation based on SpeechItem model
return item
elif item_type == 'silence':
# Pydantic will handle further validation based on SilenceItem model
return item
raise ValueError(f"Unknown dialog item type: {item_type}. Must be 'speech' or 'silence'.")
class DialogResponse(BaseModel):
log: str = Field(description="Log of the dialog generation process.")
# For now, these URLs might be relative paths or placeholders.
# Actual serving strategy will determine the final URL format.
concatenated_audio_url: Optional[str] = Field(None, description="URL/path to the concatenated audio file.")
zip_archive_url: Optional[str] = Field(None, description="URL/path to the ZIP archive of all audio files.")
temp_dir_path: Optional[str] = Field(None, description="Path to the temporary directory holding generated files, for server-side reference.")
error_message: Optional[str] = Field(None, description="Error message if the process failed globally.")

View File

@ -0,0 +1,20 @@
from pydantic import BaseModel
from typing import Optional
class SpeakerBase(BaseModel):
name: str
class SpeakerCreate(SpeakerBase):
# For receiving speaker name, file will be handled separately by FastAPI's UploadFile
pass
class Speaker(SpeakerBase):
id: str
sample_path: Optional[str] = None # Path to the speaker's audio sample
class Config:
from_attributes = True # Replaces orm_mode = True in Pydantic v2
class SpeakerResponse(SpeakerBase):
id: str
message: Optional[str] = None

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,276 @@
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
from pathlib import Path
import shutil
import os
from app.models.dialog_models import DialogRequest, DialogResponse
from app.services.tts_service import TTSService
from app.services.speaker_service import SpeakerManagementService
from app.services.dialog_processor_service import DialogProcessorService
from app.services.audio_manipulation_service import AudioManipulationService
from app import config
from typing import AsyncIterator
from app.services.model_manager import ModelManager
router = APIRouter()
# --- Dependency Injection for Services ---
# These can be more sophisticated with a proper DI container or FastAPI's Depends system if services had complex init.
# For now, direct instantiation or simple Depends is fine.
async def get_tts_service() -> AsyncIterator[TTSService]:
"""Dependency that holds a usage token for the duration of the request."""
manager = ModelManager.instance()
async with manager.using():
service = await manager.get_service()
yield service
def get_speaker_management_service():
return SpeakerManagementService()
def get_dialog_processor_service(
tts_service: TTSService = Depends(get_tts_service),
speaker_service: SpeakerManagementService = Depends(get_speaker_management_service)
):
return DialogProcessorService(tts_service=tts_service, speaker_service=speaker_service)
def get_audio_manipulation_service():
return AudioManipulationService()
# --- Helper imports ---
from app.models.dialog_models import SpeechItem, SilenceItem
from app.services.tts_service import TTSService
from app.services.audio_manipulation_service import AudioManipulationService
from app.services.speaker_service import SpeakerManagementService
from fastapi import Body
import uuid
from pathlib import Path
@router.post("/generate_line")
async def generate_line(
item: dict = Body(...),
tts_service: TTSService = Depends(get_tts_service),
audio_manipulator: AudioManipulationService = Depends(get_audio_manipulation_service),
speaker_service: SpeakerManagementService = Depends(get_speaker_management_service)
):
"""
Generate audio for a single dialog line (speech or silence).
Returns the URL of the generated audio file, or error details on failure.
"""
try:
if item.get("type") == "speech":
speech = SpeechItem(**item)
filename_base = f"line_{uuid.uuid4().hex}"
out_dir = Path(config.DIALOG_GENERATED_DIR)
# Get speaker sample path
speaker_info = speaker_service.get_speaker_by_id(speech.speaker_id)
if not speaker_info or not getattr(speaker_info, 'sample_path', None):
raise HTTPException(status_code=404, detail=f"Speaker sample for ID '{speech.speaker_id}' not found.")
speaker_sample_path = speaker_info.sample_path
# Ensure absolute path
if not os.path.isabs(speaker_sample_path):
speaker_sample_path = str((Path(config.SPEAKER_SAMPLES_DIR) / Path(speaker_sample_path).name).resolve())
# Generate speech (async)
out_path = await tts_service.generate_speech(
text=speech.text,
speaker_sample_path=speaker_sample_path,
output_filename_base=filename_base,
speaker_id=speech.speaker_id,
output_dir=out_dir,
exaggeration=speech.exaggeration,
cfg_weight=speech.cfg_weight,
temperature=speech.temperature
)
audio_url = f"/generated_audio/{out_path.name}"
return {"audio_url": audio_url}
elif item.get("type") == "silence":
silence = SilenceItem(**item)
filename = f"silence_{uuid.uuid4().hex}.wav"
out_dir = Path(config.DIALOG_GENERATED_DIR)
out_dir.mkdir(parents=True, exist_ok=True) # Ensure output directory exists
out_path = out_dir / filename
try:
# Generate silence
silence_tensor = audio_manipulator.generate_silence(silence.duration)
import torchaudio
torchaudio.save(str(out_path), silence_tensor, audio_manipulator.sample_rate)
if not out_path.exists() or out_path.stat().st_size == 0:
raise HTTPException(
status_code=500,
detail=f"Failed to generate silence. Output file not created: {out_path}"
)
audio_url = f"/generated_audio/{filename}"
return {"audio_url": audio_url}
except Exception as e:
if isinstance(e, HTTPException):
raise e
raise HTTPException(
status_code=500,
detail=f"Error generating silence: {str(e)}"
)
else:
raise HTTPException(
status_code=400,
detail=f"Unknown dialog item type: {item.get('type')}. Expected 'speech' or 'silence'."
)
except HTTPException as he:
# Re-raise HTTP exceptions as-is
raise he
except Exception as e:
import traceback
tb = traceback.format_exc()
error_detail = f"Unexpected error: {str(e)}\n\nTraceback:\n{tb}"
print(error_detail) # Log to console for debugging
raise HTTPException(
status_code=500,
detail=error_detail
)
# Removed per-request load/unload in favor of ModelManager idle eviction.
async def process_dialog_flow(
request: DialogRequest,
dialog_processor: DialogProcessorService,
audio_manipulator: AudioManipulationService,
background_tasks: BackgroundTasks
) -> DialogResponse:
"""Core logic for processing the dialog request."""
processing_log_entries = []
concatenated_audio_file_path = None
zip_archive_file_path = None
final_temp_dir_path_str = None
try:
# 1. Process dialog to generate segments
# The DialogProcessorService creates its own temp dir for segments
dialog_processing_result = await dialog_processor.process_dialog(
dialog_items=[item.model_dump() for item in request.dialog_items],
output_base_name=request.output_base_name
)
processing_log_entries.append(dialog_processing_result['log'])
segment_details = dialog_processing_result['segment_files']
temp_segment_dir = Path(dialog_processing_result['temp_dir'])
final_temp_dir_path_str = str(temp_segment_dir)
# Filter out error segments for concatenation and zipping
valid_segment_paths_for_concat = [
Path(s['path']) for s in segment_details
if s['type'] == 'speech' and s.get('path') and Path(s['path']).exists()
]
# Create a list of dicts suitable for concatenation service (speech paths and silence durations)
items_for_concatenation = []
for s_detail in segment_details:
if s_detail['type'] == 'speech' and s_detail.get('path') and Path(s_detail['path']).exists():
items_for_concatenation.append({'type': 'speech', 'path': s_detail['path']})
elif s_detail['type'] == 'silence' and 'duration' in s_detail:
items_for_concatenation.append({'type': 'silence', 'duration': s_detail['duration']})
# Errors are already logged by DialogProcessor
if not any(item['type'] == 'speech' for item in items_for_concatenation):
message = "No valid speech segments were generated. Cannot create concatenated audio or ZIP."
processing_log_entries.append(message)
return DialogResponse(
log="\n".join(processing_log_entries),
temp_dir_path=final_temp_dir_path_str,
error_message=message
)
# 2. Concatenate audio segments
config.DIALOG_GENERATED_DIR.mkdir(parents=True, exist_ok=True)
concat_filename = f"{request.output_base_name}_concatenated.wav"
concatenated_audio_file_path = config.DIALOG_GENERATED_DIR / concat_filename
audio_manipulator.concatenate_audio_segments(
segment_results=items_for_concatenation,
output_concatenated_path=concatenated_audio_file_path
)
processing_log_entries.append(f"Concatenated audio saved to: {concatenated_audio_file_path}")
# 3. Create ZIP archive
zip_filename = f"{request.output_base_name}_dialog_output.zip"
zip_archive_path = config.DIALOG_GENERATED_DIR / zip_filename
# Collect all valid generated speech segment files for zipping
individual_segment_paths = [
Path(s['path']) for s in segment_details
if s['type'] == 'speech' and s.get('path') and Path(s['path']).exists()
]
# concatenated_audio_file_path is already defined and checked for existence before this block
audio_manipulator.create_zip_archive(
segment_file_paths=individual_segment_paths,
concatenated_audio_path=concatenated_audio_file_path,
output_zip_path=zip_archive_path
)
processing_log_entries.append(f"ZIP archive created at: {zip_archive_path}")
# Schedule cleanup of the temporary segment directory
# background_tasks.add_task(shutil.rmtree, temp_segment_dir, ignore_errors=True)
# processing_log_entries.append(f"Scheduled cleanup for temporary segment directory: {temp_segment_dir}")
# For now, let's not auto-delete, so user can inspect. Cleanup can be a separate endpoint/job.
processing_log_entries.append(f"Temporary segment directory for inspection: {temp_segment_dir}")
return DialogResponse(
log="\n".join(processing_log_entries),
# URLs should be relative to a static serving path, e.g., /generated_audio/
# For now, just returning the name, assuming they are in DIALOG_OUTPUT_DIR
concatenated_audio_url=f"/generated_audio/{concat_filename}",
zip_archive_url=f"/generated_audio/{zip_filename}",
temp_dir_path=final_temp_dir_path_str
)
except FileNotFoundError as e:
error_msg = f"File not found during dialog generation: {e}"
processing_log_entries.append(error_msg)
raise HTTPException(status_code=404, detail=error_msg)
except ValueError as e:
error_msg = f"Invalid value or configuration: {e}"
processing_log_entries.append(error_msg)
raise HTTPException(status_code=400, detail=error_msg)
except RuntimeError as e:
error_msg = f"Runtime error during dialog generation: {e}"
processing_log_entries.append(error_msg)
# This could be a 500 if it's an unexpected server error
raise HTTPException(status_code=500, detail=error_msg)
except Exception as e:
import traceback
error_msg = f"An unexpected error occurred: {e}\n{traceback.format_exc()}"
processing_log_entries.append(error_msg)
raise HTTPException(status_code=500, detail=error_msg)
finally:
# Ensure logs are captured even if an early exception occurs before full response construction
if not concatenated_audio_file_path and not zip_archive_file_path and processing_log_entries:
print("Dialog generation failed. Log: \n" + "\n".join(processing_log_entries))
@router.post("/generate", response_model=DialogResponse)
async def generate_dialog_endpoint(
request: DialogRequest,
background_tasks: BackgroundTasks,
tts_service: TTSService = Depends(get_tts_service),
dialog_processor: DialogProcessorService = Depends(get_dialog_processor_service),
audio_manipulator: AudioManipulationService = Depends(get_audio_manipulation_service)
):
"""
Generates a dialog from a list of speech and silence items.
- Processes text into manageable chunks.
- Generates speech for each chunk using the specified speaker.
- Inserts silences as requested.
- Concatenates all audio segments into a single file.
- Creates a ZIP archive of all individual segments and the concatenated file.
"""
# Execute core processing; ModelManager dependency keeps the model marked "in use".
return await process_dialog_flow(
request=request,
dialog_processor=dialog_processor,
audio_manipulator=audio_manipulator,
background_tasks=background_tasks,
)

View File

@ -0,0 +1,81 @@
from typing import List, Annotated
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, Form
from app.models.speaker_models import Speaker, SpeakerResponse
from app.services.speaker_service import SpeakerManagementService
router = APIRouter(
tags=["Speakers"],
responses={404: {"description": "Not found"}},
)
# Dependency to get the speaker service instance
# This could be more sophisticated with a proper DI system later
def get_speaker_service():
return SpeakerManagementService()
@router.get("/", response_model=List[Speaker])
async def get_all_speakers(
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Retrieve all available speakers.
"""
return service.get_speakers()
@router.post("/", response_model=SpeakerResponse, status_code=201)
async def create_new_speaker(
name: Annotated[str, Form()],
audio_file: Annotated[UploadFile, File()],
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Add a new speaker.
Requires speaker name (form data) and an audio sample file (file upload).
"""
if not audio_file.filename:
raise HTTPException(status_code=400, detail="No audio file provided.")
if not audio_file.content_type or not audio_file.content_type.startswith("audio/"):
raise HTTPException(status_code=400, detail="Invalid audio file type. Please upload a valid audio file (e.g., WAV, MP3).")
try:
new_speaker = await service.add_speaker(name=name, audio_file=audio_file)
return SpeakerResponse(
id=new_speaker.id,
name=new_speaker.name,
message="Speaker added successfully."
)
except HTTPException as e:
# Re-raise HTTPExceptions from the service (e.g., file save error)
raise e
except Exception as e:
# Catch-all for other unexpected errors
raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {str(e)}")
@router.get("/{speaker_id}", response_model=Speaker)
async def get_speaker_details(
speaker_id: str,
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Get details for a specific speaker by ID.
"""
speaker = service.get_speaker_by_id(speaker_id)
if not speaker:
raise HTTPException(status_code=404, detail="Speaker not found")
return speaker
@router.delete("/{speaker_id}", response_model=dict)
async def remove_speaker(
speaker_id: str,
service: Annotated[SpeakerManagementService, Depends(get_speaker_service)]
):
"""
Delete a speaker by ID.
"""
deleted = service.delete_speaker(speaker_id)
if not deleted:
raise HTTPException(status_code=404, detail="Speaker not found or could not be deleted.")
return {"message": "Speaker deleted successfully"}

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,241 @@
import torch
import torchaudio
from pathlib import Path
from typing import List, Dict, Union, Tuple
import zipfile
# Define a common sample rate, e.g., from the TTS model. This should ideally be configurable or dynamically obtained.
# For now, let's assume the TTS model (ChatterboxTTS) outputs at a known sample rate.
# The ChatterboxTTS model.sr is 24000.
DEFAULT_SAMPLE_RATE = 24000
class AudioManipulationService:
def __init__(self, default_sample_rate: int = DEFAULT_SAMPLE_RATE):
self.sample_rate = default_sample_rate
def _load_audio(self, file_path: Union[str, Path]) -> Tuple[torch.Tensor, int]:
"""Loads an audio file and returns the waveform and sample rate."""
try:
waveform, sr = torchaudio.load(file_path)
return waveform, sr
except Exception as e:
raise RuntimeError(f"Error loading audio file {file_path}: {e}")
def _create_silence(self, duration_seconds: float) -> torch.Tensor:
"""Creates a silent audio tensor of a given duration."""
num_frames = int(duration_seconds * self.sample_rate)
return torch.zeros((1, num_frames)) # Mono silence
def concatenate_audio_segments(
self,
segment_results: List[Dict],
output_concatenated_path: Path
) -> Path:
"""
Concatenates audio segments and silences into a single audio file.
Args:
segment_results: A list of dictionaries, where each dict represents an audio
segment or a silence. Expected format:
For speech: {'type': 'speech', 'path': 'path/to/audio.wav', ...}
For silence: {'type': 'silence', 'duration': 0.5, ...}
output_concatenated_path: The path to save the final concatenated audio file.
Returns:
The path to the concatenated audio file.
"""
all_waveforms: List[torch.Tensor] = []
current_sample_rate = self.sample_rate # Assume this initially, verify with first loaded audio
for i, segment_info in enumerate(segment_results):
segment_type = segment_info.get("type")
if segment_type == "speech":
audio_path_str = segment_info.get("path")
if not audio_path_str:
print(f"Warning: Speech segment {i} has no path. Skipping.")
continue
audio_path = Path(audio_path_str)
if not audio_path.exists():
print(f"Warning: Audio file {audio_path} for segment {i} not found. Skipping.")
continue
try:
waveform, sr = self._load_audio(audio_path)
# Ensure consistent sample rate. Resample if necessary.
# For simplicity, this example assumes all inputs will match self.sample_rate
# or the first loaded audio's sample rate. A more robust implementation
# would resample if sr != current_sample_rate.
if i == 0 and not all_waveforms: # First audio segment sets the reference SR if not default
current_sample_rate = sr
if sr != self.sample_rate:
print(f"Warning: First audio segment SR ({sr} Hz) differs from service default SR ({self.sample_rate} Hz). Using segment SR.")
if sr != current_sample_rate:
print(f"Warning: Sample rate mismatch for {audio_path} ({sr} Hz) vs expected ({current_sample_rate} Hz). Resampling...")
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=current_sample_rate)
waveform = resampler(waveform)
# Ensure mono. If stereo, take the mean or first channel.
if waveform.shape[0] > 1:
waveform = torch.mean(waveform, dim=0, keepdim=True)
all_waveforms.append(waveform)
except Exception as e:
print(f"Error processing speech segment {audio_path}: {e}. Skipping.")
elif segment_type == "silence":
duration = segment_info.get("duration")
if duration is None or not isinstance(duration, (int, float)) or duration < 0:
print(f"Warning: Silence segment {i} has invalid duration. Skipping.")
continue
silence_waveform = self._create_silence(float(duration))
all_waveforms.append(silence_waveform)
elif segment_type == "error":
# Errors are already logged by DialogProcessorService, just skip here.
print(f"Skipping segment {i} due to previous error: {segment_info.get('message')}")
continue
else:
print(f"Warning: Unknown segment type '{segment_type}' at index {i}. Skipping.")
if not all_waveforms:
raise ValueError("No valid audio segments or silences found to concatenate.")
# Concatenate all waveforms
final_waveform = torch.cat(all_waveforms, dim=1)
# Ensure output directory exists
output_concatenated_path.parent.mkdir(parents=True, exist_ok=True)
# Save the concatenated audio
try:
torchaudio.save(str(output_concatenated_path), final_waveform, current_sample_rate)
print(f"Concatenated audio saved to: {output_concatenated_path}")
return output_concatenated_path
except Exception as e:
raise RuntimeError(f"Error saving concatenated audio to {output_concatenated_path}: {e}")
def create_zip_archive(
self,
segment_file_paths: List[Path],
concatenated_audio_path: Path,
output_zip_path: Path
) -> Path:
"""
Creates a ZIP archive containing individual audio segments and the concatenated audio file.
Args:
segment_file_paths: A list of paths to the individual audio segment files.
concatenated_audio_path: Path to the final concatenated audio file.
output_zip_path: The path to save the output ZIP archive.
Returns:
The path to the created ZIP archive.
"""
output_zip_path.parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
# Add concatenated audio
if concatenated_audio_path.exists():
zf.write(concatenated_audio_path, arcname=concatenated_audio_path.name)
else:
print(f"Warning: Concatenated audio file {concatenated_audio_path} not found for zipping.")
# Add individual segments
segments_dir_name = "segments"
for file_path in segment_file_paths:
if file_path.exists() and file_path.is_file():
# Store segments in a subdirectory within the zip for organization
zf.write(file_path, arcname=Path(segments_dir_name) / file_path.name)
else:
print(f"Warning: Segment file {file_path} not found or is not a file. Skipping for zipping.")
print(f"ZIP archive created at: {output_zip_path}")
return output_zip_path
# Example Usage (Test Block)
if __name__ == "__main__":
import tempfile
import shutil
# Create a temporary directory for test files
test_temp_dir = Path(tempfile.mkdtemp(prefix="audio_manip_test_"))
print(f"Created temporary test directory: {test_temp_dir}")
# Instance of the service
audio_service = AudioManipulationService()
# --- Test Data Setup ---
# Create dummy audio files (e.g., short silences with different names)
dummy_sr = audio_service.sample_rate
segment1_path = test_temp_dir / "segment1_speech.wav"
segment2_path = test_temp_dir / "segment2_speech.wav"
torchaudio.save(str(segment1_path), audio_service._create_silence(1.0), dummy_sr)
# Create a dummy segment with a different sample rate to test resampling
dummy_sr_alt = 16000
temp_waveform_alt_sr = torch.rand((1, int(0.5 * dummy_sr_alt))) # 0.5s at 16kHz
torchaudio.save(str(segment2_path), temp_waveform_alt_sr, dummy_sr_alt)
segment_results_for_concat = [
{"type": "speech", "path": str(segment1_path), "speaker_id": "spk1", "text_chunk": "Test 1"},
{"type": "silence", "duration": 0.5},
{"type": "speech", "path": str(segment2_path), "speaker_id": "spk2", "text_chunk": "Test 2 (alt SR)"},
{"type": "error", "message": "Simulated error, should be skipped"},
{"type": "speech", "path": "non_existent_segment.wav"}, # Test non-existent file
{"type": "silence", "duration": -0.2} # Test invalid duration
]
concatenated_output_path = test_temp_dir / "final_concatenated_audio.wav"
zip_output_path = test_temp_dir / "audio_archive.zip"
all_segment_files_for_zip = [segment1_path, segment2_path]
try:
# Test concatenation
print("\n--- Testing Concatenation ---")
actual_concat_path = audio_service.concatenate_audio_segments(
segment_results_for_concat,
concatenated_output_path
)
print(f"Concatenation test successful. Output: {actual_concat_path}")
assert actual_concat_path.exists()
# Basic check: load concatenated and verify duration (approx)
concat_wav, concat_sr = audio_service._load_audio(actual_concat_path)
expected_duration = 1.0 + 0.5 + 0.5 # seg1 (1.0s) + silence (0.5s) + seg2 (0.5s) = 2.0s
actual_duration = concat_wav.shape[1] / concat_sr
print(f"Expected duration (approx): {expected_duration}s, Actual duration: {actual_duration:.2f}s")
assert abs(actual_duration - expected_duration) < 0.1 # Allow small deviation
# Test Zipping
print("\n--- Testing Zipping ---")
actual_zip_path = audio_service.create_zip_archive(
all_segment_files_for_zip,
actual_concat_path,
zip_output_path
)
print(f"Zipping test successful. Output: {actual_zip_path}")
assert actual_zip_path.exists()
# Verify zip contents (basic check)
segments_dir_name = "segments" # Define this for the assertion below
with zipfile.ZipFile(actual_zip_path, 'r') as zf_read:
zip_contents = zf_read.namelist()
print(f"ZIP contents: {zip_contents}")
assert Path(segments_dir_name) / segment1_path.name in [Path(p) for p in zip_contents]
assert Path(segments_dir_name) / segment2_path.name in [Path(p) for p in zip_contents]
assert concatenated_output_path.name in zip_contents
print("\nAll AudioManipulationService tests passed!")
except Exception as e:
import traceback
print(f"\nAn error occurred during AudioManipulationService tests:")
traceback.print_exc()
finally:
# Clean up temporary directory
# shutil.rmtree(test_temp_dir)
# print(f"Cleaned up temporary test directory: {test_temp_dir}")
print(f"Test files are in {test_temp_dir}. Please inspect and delete manually if needed.")

View File

@ -0,0 +1,378 @@
from pathlib import Path
from typing import List, Dict, Any, Union
import re
import asyncio
from datetime import datetime
from .tts_service import TTSService
from .speaker_service import SpeakerManagementService
try:
from app import config
except ModuleNotFoundError:
# When imported from scripts at project root
from backend.app import config
# Potentially models for dialog structure if we define them
# from ..models.dialog_models import DialogItem # Example
class DialogProcessorService:
def __init__(self, tts_service: TTSService, speaker_service: SpeakerManagementService):
self.tts_service = tts_service
self.speaker_service = speaker_service
# Base directory for storing individual audio segments during processing
self.temp_audio_dir = config.TTS_TEMP_OUTPUT_DIR
self.temp_audio_dir.mkdir(parents=True, exist_ok=True)
def _split_text(self, text: str, max_length: int = 300) -> List[str]:
"""
Splits text into chunks suitable for TTS processing, attempting to respect sentence boundaries.
Similar to split_text_at_sentence_boundaries from the original Gradio app.
Max_length is approximate, as it tries to finish sentences.
"""
# Basic sentence splitting using common delimiters. More sophisticated NLP could be used.
# This regex tries to split by '.', '!', '?', '...', followed by space or end of string.
# It also handles cases where these delimiters might be followed by quotes or parentheses.
sentences = re.split(r'(?<=[.!?\u2026])\s+|(?<=[.!?\u2026])(?=["\')\]\}\u201d\u2019])|(?<=[.!?\u2026])$', text.strip())
sentences = [s.strip() for s in sentences if s and s.strip()]
chunks = []
current_chunk = ""
for sentence in sentences:
if not sentence:
continue
if not current_chunk: # First sentence for this chunk
current_chunk = sentence
elif len(current_chunk) + len(sentence) + 1 <= max_length:
current_chunk += " " + sentence
else:
chunks.append(current_chunk)
current_chunk = sentence
if current_chunk: # Add the last chunk
chunks.append(current_chunk)
# Further split any chunks that are still too long (e.g., a single very long sentence)
final_chunks = []
for chunk in chunks:
if len(chunk) > max_length:
# Simple split by length if a sentence itself is too long
for i in range(0, len(chunk), max_length):
final_chunks.append(chunk[i:i+max_length])
else:
final_chunks.append(chunk)
return final_chunks
async def process_dialog(self, dialog_items: List[Dict[str, Any]], output_base_name: str) -> Dict[str, Any]:
"""
Processes a list of dialog items (speech or silence) to generate audio segments.
Args:
dialog_items: A list of dictionaries, where each item has:
- 'type': 'speech' or 'silence'
- For 'speech': 'speaker_id': str, 'text': str
- For 'silence': 'duration': float (in seconds)
output_base_name: The base name for the output files.
Returns:
A dictionary containing paths to generated segments and other processing info.
Example: {
"log": "Processing complete...",
"segment_files": [
{"type": "speech", "path": "/path/to/segment1.wav", "speaker_id": "X", "text_chunk": "..."},
{"type": "silence", "duration": 0.5},
{"type": "speech", "path": "/path/to/segment2.wav", "speaker_id": "Y", "text_chunk": "..."}
],
"temp_dir": str(self.temp_audio_dir / output_base_name)
}
"""
segment_results = []
processing_log = []
# Create a unique subdirectory for this dialog's temporary files
dialog_temp_dir = self.temp_audio_dir / output_base_name
dialog_temp_dir.mkdir(parents=True, exist_ok=True)
processing_log.append(f"Created temporary directory for segments: {dialog_temp_dir}")
import shutil
segment_idx = 0
tasks = []
results_map: Dict[int, Dict[str, Any]] = {}
sem = asyncio.Semaphore(getattr(config, "TTS_MAX_CONCURRENCY", 2))
async def run_one(planned: Dict[str, Any]):
async with sem:
text_chunk = planned["text_chunk"]
speaker_id = planned["speaker_id"]
abs_speaker_sample_path = planned["abs_speaker_sample_path"]
filename_base = planned["filename_base"]
params = planned["params"]
seg_idx = planned["segment_idx"]
start_ts = datetime.now()
start_line = (
f"[{start_ts.isoformat(timespec='seconds')}] [TTS-TASK] START seg_idx={seg_idx} "
f"speaker={speaker_id} chunk_len={len(text_chunk)} base={filename_base}"
)
try:
out_path = await self.tts_service.generate_speech(
text=text_chunk,
speaker_id=speaker_id,
speaker_sample_path=str(abs_speaker_sample_path),
output_filename_base=filename_base,
output_dir=dialog_temp_dir,
exaggeration=params.get('exaggeration', 0.5),
cfg_weight=params.get('cfg_weight', 0.5),
temperature=params.get('temperature', 0.8),
)
end_ts = datetime.now()
duration = (end_ts - start_ts).total_seconds()
end_line = (
f"[{end_ts.isoformat(timespec='seconds')}] [TTS-TASK] END seg_idx={seg_idx} "
f"dur={duration:.2f}s -> {out_path}"
)
return seg_idx, {
"type": "speech",
"path": str(out_path),
"speaker_id": speaker_id,
"text_chunk": text_chunk,
}, start_line + "\n" + f"Successfully generated segment: {out_path}" + "\n" + end_line
except Exception as e:
end_ts = datetime.now()
err_line = (
f"[{end_ts.isoformat(timespec='seconds')}] [TTS-TASK] ERROR seg_idx={seg_idx} "
f"speaker={speaker_id} err={repr(e)}"
)
return seg_idx, {
"type": "error",
"message": f"Error generating speech for chunk '{text_chunk[:50]}...': {repr(e)}",
"text_chunk": text_chunk,
}, err_line
for i, item in enumerate(dialog_items):
item_type = item.get("type")
processing_log.append(f"Processing item {i+1}: type='{item_type}'")
# --- Handle reuse of existing audio ---
use_existing_audio = item.get("use_existing_audio", False)
audio_url = item.get("audio_url")
if use_existing_audio and audio_url:
if audio_url.startswith("/generated_audio/"):
src_audio_path = config.DIALOG_OUTPUT_DIR / audio_url[len("/generated_audio/"):]
else:
src_audio_path = Path(audio_url)
if not src_audio_path.is_absolute():
src_audio_path = config.DIALOG_OUTPUT_DIR / audio_url.lstrip("/\\")
if src_audio_path.is_file():
segment_filename = f"{output_base_name}_seg{segment_idx}_reused.wav"
dest_path = (self.temp_audio_dir / output_base_name / segment_filename)
try:
if not src_audio_path.exists():
processing_log.append(f"[REUSE] Source audio file does not exist: {src_audio_path}")
else:
processing_log.append(f"[REUSE] Source audio file exists: {src_audio_path}, size={src_audio_path.stat().st_size} bytes")
shutil.copyfile(src_audio_path, dest_path)
if not dest_path.exists():
processing_log.append(f"[REUSE] Destination audio file was not created: {dest_path}")
else:
processing_log.append(f"[REUSE] Destination audio file created: {dest_path}, size={dest_path.stat().st_size} bytes")
results_map[segment_idx] = {"type": item_type, "path": str(dest_path)}
processing_log.append(f"Reused existing audio for item {i+1}: copied from {src_audio_path} to {dest_path}")
except Exception as e:
error_message = f"Failed to copy reused audio for item {i+1}: {e}"
processing_log.append(error_message)
results_map[segment_idx] = {"type": "error", "message": error_message}
segment_idx += 1
continue
else:
error_message = f"Audio file for reuse not found at {src_audio_path} for item {i+1}."
processing_log.append(error_message)
results_map[segment_idx] = {"type": "error", "message": error_message}
segment_idx += 1
continue
if item_type == "speech":
speaker_id = item.get("speaker_id")
text = item.get("text")
if not speaker_id or not text:
processing_log.append(f"Skipping speech item {i+1} due to missing speaker_id or text.")
results_map[segment_idx] = {"type": "error", "message": "Missing speaker_id or text"}
segment_idx += 1
continue
speaker_info = self.speaker_service.get_speaker_by_id(speaker_id)
if not speaker_info:
processing_log.append(f"Speaker ID '{speaker_id}' not found. Skipping item {i+1}.")
results_map[segment_idx] = {"type": "error", "message": f"Speaker ID '{speaker_id}' not found"}
segment_idx += 1
continue
if not speaker_info.sample_path:
processing_log.append(f"Speaker ID '{speaker_id}' has no sample path defined. Skipping item {i+1}.")
results_map[segment_idx] = {"type": "error", "message": f"Speaker ID '{speaker_id}' has no sample path defined"}
segment_idx += 1
continue
abs_speaker_sample_path = config.SPEAKER_DATA_BASE_DIR / speaker_info.sample_path
if not abs_speaker_sample_path.is_file():
processing_log.append(f"Speaker sample file not found or is not a file at '{abs_speaker_sample_path}' for speaker ID '{speaker_id}'. Skipping item {i+1}.")
results_map[segment_idx] = {"type": "error", "message": f"Speaker sample not a file or not found: {abs_speaker_sample_path}"}
segment_idx += 1
continue
text_chunks = self._split_text(text)
processing_log.append(f"Split text for speaker '{speaker_id}' into {len(text_chunks)} chunk(s).")
for chunk_idx, text_chunk in enumerate(text_chunks):
filename_base = f"{output_base_name}_seg{segment_idx}_spk{speaker_id}_chunk{chunk_idx}"
processing_log.append(f"Queueing TTS for chunk: '{text_chunk[:50]}...' using speaker '{speaker_id}'")
planned = {
"segment_idx": segment_idx,
"speaker_id": speaker_id,
"text_chunk": text_chunk,
"abs_speaker_sample_path": abs_speaker_sample_path,
"filename_base": filename_base,
"params": {
'exaggeration': item.get('exaggeration', 0.5),
'cfg_weight': item.get('cfg_weight', 0.5),
'temperature': item.get('temperature', 0.8),
},
}
tasks.append(asyncio.create_task(run_one(planned)))
segment_idx += 1
elif item_type == "silence":
duration = item.get("duration")
if duration is None or duration < 0:
processing_log.append(f"Skipping silence item {i+1} due to invalid duration.")
results_map[segment_idx] = {"type": "error", "message": "Invalid duration for silence"}
segment_idx += 1
continue
results_map[segment_idx] = {"type": "silence", "duration": float(duration)}
processing_log.append(f"Added silence of {duration}s.")
segment_idx += 1
else:
processing_log.append(f"Unknown item type '{item_type}' at item {i+1}. Skipping.")
results_map[segment_idx] = {"type": "error", "message": f"Unknown item type: {item_type}"}
segment_idx += 1
# Await all TTS tasks and merge results
if tasks:
processing_log.append(
f"Dispatching {len(tasks)} TTS task(s) with concurrency limit "
f"{getattr(config, 'TTS_MAX_CONCURRENCY', 2)}"
)
completed = await asyncio.gather(*tasks, return_exceptions=False)
for idx, payload, maybe_log in completed:
results_map[idx] = payload
if maybe_log:
processing_log.append(maybe_log)
# Build ordered list
for idx in sorted(results_map.keys()):
segment_results.append(results_map[idx])
# Log the full segment_results list for debugging
processing_log.append("[DEBUG] Final segment_results list:")
for idx, seg in enumerate(segment_results):
processing_log.append(f" [{idx}] {seg}")
return {
"log": "\n".join(processing_log),
"segment_files": segment_results,
"temp_dir": str(dialog_temp_dir)
}
if __name__ == "__main__":
import asyncio
import pprint
async def main_test():
# Initialize services
tts_service = TTSService(device="mps") # or your preferred device
speaker_service = SpeakerManagementService()
dialog_processor = DialogProcessorService(tts_service, speaker_service)
# Ensure dummy speaker sample exists (TTSService test block usually creates this)
# For robustness, we can call the TTSService test logic or ensure it's run prior.
# Here, we assume dummy_speaker_test.wav is available as per previous steps.
# If not, the 'test_speaker_for_dialog_proc' will fail file validation.
# First, ensure the dummy speaker file is created by TTSService's own test logic
# This is a bit of a hack for testing; ideally, test assets are managed independently.
try:
print("Ensuring dummy speaker sample is created by running TTSService's main_test logic...")
from .tts_service import main_test as tts_main_test
await tts_main_test() # This will create the dummy_speaker_test.wav
print("TTSService main_test completed, dummy sample should exist.")
except ImportError:
print("Could not import tts_service.main_test directly. Ensure dummy_speaker_test.wav exists.")
except Exception as e:
print(f"Error running tts_service.main_test for dummy sample creation: {e}")
print("Proceeding, but 'test_speaker_for_dialog_proc' might fail if sample is missing.")
sample_dialog_items = [
{
"type": "speech",
"speaker_id": "test_speaker_for_dialog_proc", # Defined in speakers.yaml
"text": "Hello world! This is the first speech segment."
},
{
"type": "silence",
"duration": 0.75
},
{
"type": "speech",
"speaker_id": "test_speaker_for_dialog_proc",
"text": "This is a much longer piece of text that should definitely be split into multiple, smaller chunks by the dialog processor. It contains several sentences. Let's see how it handles this. The maximum length is set to 300 characters, but it tries to respect sentence boundaries. This sentence itself is quite long and might even be split mid-sentence if it exceeds the hard limit after sentence splitting. We will observe the output carefully to ensure it works as expected, creating multiple audio files for this single text block if necessary."
},
{
"type": "speech",
"speaker_id": "non_existent_speaker_id",
"text": "This should fail because the speaker does not exist."
},
{
"type": "invalid_type",
"text": "This item has an invalid type."
},
{
"type": "speech",
"speaker_id": "test_speaker_for_dialog_proc",
"text": None # Test missing text
},
{
"type": "speech",
"speaker_id": None, # Test missing speaker_id
"text": "This is a test with a missing speaker ID."
},
{
"type": "silence",
"duration": -0.5 # Invalid duration
}
]
output_base_name = "dialog_processor_test_run"
try:
print(f"\nLoading TTS model for DialogProcessorService test...")
# TTSService's generate_speech will load the model if not already loaded.
# However, explicit load/unload is good practice for a test block.
tts_service.load_model()
print(f"\nProcessing dialog items with base name: {output_base_name}...")
results = await dialog_processor.process_dialog(sample_dialog_items, output_base_name)
print("\n--- Processing Log ---")
print(results.get("log"))
print("\n--- Segment Files / Results ---")
pprint.pprint(results.get("segment_files"))
print(f"\nTemporary directory used: {results.get('temp_dir')}")
print("\nPlease check the temporary directory for generated audio segments.")
except Exception as e:
import traceback
print(f"\nAn error occurred during the DialogProcessorService test:")
traceback.print_exc()
finally:
print("\nUnloading TTS model...")
tts_service.unload_model()
print("DialogProcessorService test finished.")
asyncio.run(main_test())

View File

@ -0,0 +1,170 @@
import asyncio
import time
import logging
from typing import Optional
import gc
import os
_proc = None
try:
import psutil # type: ignore
_proc = psutil.Process(os.getpid())
except Exception:
psutil = None # type: ignore
def _rss_mb() -> float:
"""Return current process RSS in MB, or -1.0 if unavailable."""
global _proc
try:
if _proc is None and psutil is not None:
_proc = psutil.Process(os.getpid())
if _proc is not None:
return _proc.memory_info().rss / (1024 * 1024)
except Exception:
return -1.0
return -1.0
try:
import torch # Optional; used for cache cleanup metrics
except Exception: # pragma: no cover - torch may not be present in some envs
torch = None # type: ignore
from app import config
from app.services.tts_service import TTSService
logger = logging.getLogger(__name__)
class ModelManager:
_instance: Optional["ModelManager"] = None
def __init__(self):
self._service: Optional[TTSService] = None
self._last_used: float = time.time()
self._active: int = 0
self._lock = asyncio.Lock()
self._counter_lock = asyncio.Lock()
@classmethod
def instance(cls) -> "ModelManager":
if not cls._instance:
cls._instance = cls()
return cls._instance
async def _ensure_service(self) -> None:
if self._service is None:
# Use configured device, default is handled by TTSService itself
device = getattr(config, "DEVICE", "auto")
# TTSService presently expects explicit device like "mps"/"cpu"/"cuda"; map "auto" to "mps" on Mac otherwise cpu
if device == "auto":
try:
import torch
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
except Exception:
device = "cpu"
self._service = TTSService(device=device)
async def load(self) -> None:
async with self._lock:
await self._ensure_service()
if self._service and self._service.model is None:
before_mb = _rss_mb()
logger.info(
"Loading TTS model (device=%s)... (rss_before=%.1f MB)",
self._service.device,
before_mb,
)
self._service.load_model()
after_mb = _rss_mb()
if after_mb >= 0 and before_mb >= 0:
logger.info(
"TTS model loaded (rss_after=%.1f MB, delta=%.1f MB)",
after_mb,
after_mb - before_mb,
)
self._last_used = time.time()
async def unload(self) -> None:
async with self._lock:
if not self._service:
return
if self._active > 0:
logger.debug("Skip unload: %d active operations", self._active)
return
if self._service.model is not None:
before_mb = _rss_mb()
logger.info(
"Unloading idle TTS model... (rss_before=%.1f MB, active=%d)",
before_mb,
self._active,
)
self._service.unload_model()
# Drop the service instance as well to release any lingering refs
self._service = None
# Force GC and attempt allocator cache cleanup
try:
gc.collect()
finally:
if torch is not None:
try:
if hasattr(torch, "cuda") and torch.cuda.is_available():
torch.cuda.empty_cache()
except Exception:
logger.debug("cuda.empty_cache() failed", exc_info=True)
try:
# MPS empty_cache may exist depending on torch version
mps = getattr(torch, "mps", None)
if mps is not None and hasattr(mps, "empty_cache"):
mps.empty_cache()
except Exception:
logger.debug("mps.empty_cache() failed", exc_info=True)
after_mb = _rss_mb()
if after_mb >= 0 and before_mb >= 0:
logger.info(
"Idle unload complete (rss_after=%.1f MB, delta=%.1f MB)",
after_mb,
after_mb - before_mb,
)
self._last_used = time.time()
async def get_service(self) -> TTSService:
if not self._service or self._service.model is None:
await self.load()
self._last_used = time.time()
return self._service # type: ignore[return-value]
async def _inc(self) -> None:
async with self._counter_lock:
self._active += 1
async def _dec(self) -> None:
async with self._counter_lock:
self._active = max(0, self._active - 1)
self._last_used = time.time()
def last_used(self) -> float:
return self._last_used
def is_loaded(self) -> bool:
return bool(self._service and self._service.model is not None)
def active(self) -> int:
return self._active
def using(self):
manager = self
class _Ctx:
async def __aenter__(self):
await manager._inc()
return manager
async def __aexit__(self, exc_type, exc, tb):
await manager._dec()
return _Ctx()

View File

@ -0,0 +1,152 @@
import yaml
import uuid
import os
import io # Added for BytesIO
import torchaudio # Added for audio processing
from pathlib import Path
from typing import List, Dict, Optional, Any
from fastapi import UploadFile, HTTPException
try:
from app.models.speaker_models import Speaker, SpeakerCreate
from app import config
except ModuleNotFoundError:
# When imported from scripts at project root
from backend.app.models.speaker_models import Speaker, SpeakerCreate
from backend.app import config
class SpeakerManagementService:
def __init__(self):
self._ensure_data_files_exist()
self.speakers_data = self._load_speakers_data()
def _ensure_data_files_exist(self):
"""Ensures the speaker data directory and YAML file exist."""
config.SPEAKER_DATA_BASE_DIR.mkdir(parents=True, exist_ok=True)
config.SPEAKER_SAMPLES_DIR.mkdir(parents=True, exist_ok=True)
if not config.SPEAKERS_YAML_FILE.exists():
with open(config.SPEAKERS_YAML_FILE, 'w') as f:
yaml.dump({}, f) # Initialize with an empty dict, as per previous fixes
def _load_speakers_data(self) -> Dict[str, Any]: # Changed return type to Dict
"""Loads speaker data from the YAML file."""
try:
with open(config.SPEAKERS_YAML_FILE, 'r') as f:
data = yaml.safe_load(f)
return data if isinstance(data, dict) else {} # Ensure it's a dict
except FileNotFoundError:
return {}
except yaml.YAMLError:
# Handle corrupted YAML file, e.g., log error and return empty list
print(f"Error: Corrupted speakers YAML file at {config.SPEAKERS_YAML_FILE}")
return {}
def _save_speakers_data(self):
"""Saves the current speaker data to the YAML file."""
with open(config.SPEAKERS_YAML_FILE, 'w') as f:
yaml.dump(self.speakers_data, f, sort_keys=False)
def get_speakers(self) -> List[Speaker]:
"""Returns a list of all speakers."""
# self.speakers_data is now a dict: {speaker_id: {name: ..., sample_path: ...}}
return [Speaker(id=spk_id, **spk_attrs) for spk_id, spk_attrs in self.speakers_data.items()]
def get_speaker_by_id(self, speaker_id: str) -> Optional[Speaker]:
"""Retrieves a speaker by their ID."""
if speaker_id in self.speakers_data:
speaker_attributes = self.speakers_data[speaker_id]
return Speaker(id=speaker_id, **speaker_attributes)
return None
async def add_speaker(self, name: str, audio_file: UploadFile) -> Speaker:
"""Adds a new speaker, converts sample to WAV, saves it, and updates YAML."""
speaker_id = str(uuid.uuid4())
# Define standardized sample filename and path (always WAV)
sample_filename = f"{speaker_id}.wav"
sample_path = config.SPEAKER_SAMPLES_DIR / sample_filename
try:
content = await audio_file.read()
# Use BytesIO to handle the in-memory audio data for torchaudio
audio_buffer = io.BytesIO(content)
# Load audio data using torchaudio, this handles various formats (MP3, WAV, etc.)
# waveform is a tensor, sample_rate is an int
waveform, sample_rate = torchaudio.load(audio_buffer)
# Save the audio data as WAV
# Ensure the SPEAKER_SAMPLES_DIR exists (though _ensure_data_files_exist should handle it)
config.SPEAKER_SAMPLES_DIR.mkdir(parents=True, exist_ok=True)
torchaudio.save(str(sample_path), waveform, sample_rate, format="wav")
except torchaudio.TorchaudioException as e:
# More specific error for torchaudio issues (e.g. unsupported format, corrupted file)
raise HTTPException(status_code=400, detail=f"Error processing audio file: {e}. Ensure it's a valid audio format (e.g., WAV, MP3).")
except Exception as e:
# General error handling for other issues (e.g., file system errors)
raise HTTPException(status_code=500, detail=f"Could not save audio file: {e}")
finally:
await audio_file.close()
new_speaker_data = {
"id": speaker_id,
"name": name,
"sample_path": str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR)) # Store path relative to speaker_data dir
}
# self.speakers_data is now a dict
self.speakers_data[speaker_id] = {
"name": name,
"sample_path": str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR))
}
self._save_speakers_data()
# Construct Speaker model for return, including the ID
return Speaker(id=speaker_id, name=name, sample_path=str(sample_path.relative_to(config.SPEAKER_DATA_BASE_DIR)))
def delete_speaker(self, speaker_id: str) -> bool:
"""Deletes a speaker and their audio sample."""
# Speaker data is now a dictionary, keyed by speaker_id
speaker_to_delete = self.speakers_data.pop(speaker_id, None)
if speaker_to_delete:
self._save_speakers_data()
sample_path_str = speaker_to_delete.get("sample_path")
if sample_path_str:
# sample_path_str is relative to SPEAKER_DATA_BASE_DIR
full_sample_path = config.SPEAKER_DATA_BASE_DIR / sample_path_str
try:
if full_sample_path.is_file(): # Check if it's a file before removing
os.remove(full_sample_path)
except OSError as e:
# Log error if file deletion fails but proceed
print(f"Error deleting sample file {full_sample_path}: {e}")
return True
return False
# Example usage (for testing, not part of the service itself)
if __name__ == "__main__":
service = SpeakerManagementService()
print("Initial speakers:", service.get_speakers())
# This part would require a mock UploadFile to run directly
# print("\nAdding a new speaker (manual test setup needed for UploadFile)")
# class MockUploadFile:
# def __init__(self, filename, content):
# self.filename = filename
# self._content = content
# async def read(self): return self._content
# async def close(self): pass
# import asyncio
# async def test_add():
# mock_file = MockUploadFile("test.wav", b"dummy audio content")
# new_speaker = await service.add_speaker(name="Test Speaker", audio_file=mock_file)
# print("\nAdded speaker:", new_speaker)
# print("Speakers after add:", service.get_speakers())
# return new_speaker.id
# speaker_id_to_delete = asyncio.run(test_add())
# if speaker_id_to_delete:
# print(f"\nDeleting speaker {speaker_id_to_delete}")
# service.delete_speaker(speaker_id_to_delete)
# print("Speakers after delete:", service.get_speakers())

View File

@ -0,0 +1,220 @@
import torch
import torchaudio
import asyncio
from typing import Optional
from chatterbox.tts import ChatterboxTTS
from pathlib import Path
import gc # Garbage collector for memory management
import os
from contextlib import contextmanager
from datetime import datetime
import time
# Import configuration
try:
from app.config import TTS_TEMP_OUTPUT_DIR, SPEAKER_SAMPLES_DIR
except ModuleNotFoundError:
# When imported from scripts at project root
from backend.app.config import TTS_TEMP_OUTPUT_DIR, SPEAKER_SAMPLES_DIR
# Use configuration for TTS output directory
TTS_OUTPUT_DIR = TTS_TEMP_OUTPUT_DIR
def safe_load_chatterbox_tts(device):
"""
Safely load ChatterboxTTS model with device mapping to handle CUDA->MPS/CPU conversion.
This patches torch.load temporarily to map CUDA tensors to the appropriate device.
"""
@contextmanager
def patch_torch_load(target_device):
original_load = torch.load
def patched_load(*args, **kwargs):
# Add map_location to handle device mapping
if 'map_location' not in kwargs:
if target_device == "mps" and torch.backends.mps.is_available():
kwargs['map_location'] = torch.device('mps')
else:
kwargs['map_location'] = torch.device('cpu')
return original_load(*args, **kwargs)
torch.load = patched_load
try:
yield
finally:
torch.load = original_load
with patch_torch_load(device):
return ChatterboxTTS.from_pretrained(device=device)
class TTSService:
def __init__(self, device: str = "mps"): # Default to MPS for Macs, can be "cpu" or "cuda"
self.device = device
self.model = None
self._ensure_output_dir_exists()
def _ensure_output_dir_exists(self):
"""Ensures the TTS output directory exists."""
TTS_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
def load_model(self):
"""Loads the ChatterboxTTS model."""
if self.model is None:
print(f"Loading ChatterboxTTS model to device: {self.device}...")
try:
self.model = safe_load_chatterbox_tts(self.device)
print("ChatterboxTTS model loaded successfully.")
except Exception as e:
print(f"Error loading ChatterboxTTS model: {e}")
# Potentially raise an exception or handle appropriately
raise
else:
print("ChatterboxTTS model already loaded.")
def unload_model(self):
"""Unloads the model and clears memory."""
if self.model is not None:
print("Unloading ChatterboxTTS model and clearing cache...")
del self.model
self.model = None
if self.device == "cuda":
torch.cuda.empty_cache()
elif self.device == "mps":
if hasattr(torch.mps, "empty_cache"): # Check if empty_cache is available for MPS
torch.mps.empty_cache()
gc.collect() # Explicitly run garbage collection
print("Model unloaded and memory cleared.")
async def generate_speech(
self,
text: str,
speaker_sample_path: str, # Absolute path to the speaker's audio sample
output_filename_base: str, # e.g., "dialog_line_1_spk_X_chunk_0"
speaker_id: Optional[str] = None, # Optional, mainly for logging if needed, filename base is primary
output_dir: Optional[Path] = None, # Optional, defaults to TTS_OUTPUT_DIR from this module
exaggeration: float = 0.5, # Default from Gradio
cfg_weight: float = 0.5, # Default from Gradio
temperature: float = 0.8, # Default from Gradio
unload_after: bool = False, # Whether to unload the model after generation
) -> Path:
"""
Generates speech from text using the loaded TTS model and a speaker sample.
Saves the output to a .wav file.
"""
if self.model is None:
self.load_model()
if self.model is None: # Check again if loading failed
raise RuntimeError("TTS model is not loaded. Cannot generate speech.")
# Ensure speaker_sample_path is valid
speaker_sample_p = Path(speaker_sample_path)
if not speaker_sample_p.exists() or not speaker_sample_p.is_file():
raise FileNotFoundError(f"Speaker sample audio file not found: {speaker_sample_path}")
target_output_dir = output_dir if output_dir is not None else TTS_OUTPUT_DIR
target_output_dir.mkdir(parents=True, exist_ok=True)
# output_filename_base from DialogProcessorService is expected to be comprehensive (e.g., includes speaker_id, segment info)
output_file_path = target_output_dir / f"{output_filename_base}.wav"
start_ts = datetime.now()
print(f"[{start_ts.isoformat(timespec='seconds')}] [TTS] START generate+save base={output_filename_base} len={len(text)} sample={speaker_sample_path}")
try:
def _gen_and_save() -> Path:
t0 = time.perf_counter()
wav = None
try:
with torch.no_grad(): # Important for inference
wav = self.model.generate(
text=text,
audio_prompt_path=str(speaker_sample_p), # Must be a string path
exaggeration=exaggeration,
cfg_weight=cfg_weight,
temperature=temperature,
)
# Save the audio synchronously in the same thread
torchaudio.save(str(output_file_path), wav, self.model.sr)
t1 = time.perf_counter()
print(f"[TTS-THREAD] Saved {output_file_path.name} in {t1 - t0:.2f}s")
return output_file_path
finally:
# Cleanup in the same thread that created the tensor
if wav is not None:
del wav
gc.collect()
if self.device == "cuda":
torch.cuda.empty_cache()
elif self.device == "mps":
if hasattr(torch.mps, "empty_cache"):
torch.mps.empty_cache()
out_path = await asyncio.to_thread(_gen_and_save)
end_ts = datetime.now()
print(f"[{end_ts.isoformat(timespec='seconds')}] [TTS] END generate+save base={output_filename_base} dur={(end_ts - start_ts).total_seconds():.2f}s -> {out_path}")
# Optionally unload model after generation
if unload_after:
print("Unloading TTS model after generation...")
self.unload_model()
return out_path
except Exception as e:
print(f"Error during TTS generation or saving: {e}")
raise
# Example usage (for testing, not part of the service itself)
if __name__ == "__main__":
async def main_test():
tts_service = TTSService(device="mps")
try:
tts_service.load_model()
dummy_speaker_root = SPEAKER_SAMPLES_DIR
dummy_speaker_root.mkdir(parents=True, exist_ok=True)
dummy_sample_file = dummy_speaker_root / "dummy_speaker_test.wav"
import os # Added for os.remove
# Always try to remove an existing dummy file to ensure a fresh one is created
if dummy_sample_file.exists():
try:
os.remove(dummy_sample_file)
print(f"Removed existing dummy sample: {dummy_sample_file}")
except OSError as e:
print(f"Error removing existing dummy sample {dummy_sample_file}: {e}")
# Proceeding, but torchaudio.save might fail or overwrite
print(f"Creating new dummy speaker sample: {dummy_sample_file}")
# Create a minimal, silent WAV file for testing
sample_rate = 22050
duration = 1 # seconds
num_channels = 1
num_frames = sample_rate * duration
audio_data = torch.zeros((num_channels, num_frames))
try:
torchaudio.save(str(dummy_sample_file), audio_data, sample_rate)
print(f"Dummy sample created successfully: {dummy_sample_file}")
except Exception as save_e:
print(f"Could not create dummy sample: {save_e}")
# If creation fails, the subsequent generation test will likely also fail or be skipped.
if dummy_sample_file.exists():
output_path = await tts_service.generate_speech(
text="Hello, this is a test of the Text-to-Speech service.",
speaker_id="test_speaker",
speaker_sample_path=str(dummy_sample_file),
output_filename_base="test_generation"
)
print(f"Test generation output: {output_path}")
else:
print(f"Skipping generation test as dummy sample {dummy_sample_file} not found.")
except Exception as e:
import traceback
print(f"Error during TTS generation or saving:")
traceback.print_exc()
finally:
tts_service.unload_model()
import asyncio
asyncio.run(main_test())

8
backend/requirements.txt Normal file
View File

@ -0,0 +1,8 @@
fastapi
uvicorn[standard]
python-multipart
PyYAML
torch
torchaudio
chatterbox-tts
python-dotenv

152
backend/run_api_test.py Normal file
View File

@ -0,0 +1,152 @@
import requests
import json
from pathlib import Path
import time
# Configuration
API_BASE_URL = "http://localhost:8000/api/dialog"
ENDPOINT_URL = f"{API_BASE_URL}/generate"
# Define project root relative to this test script (assuming it's in backend/)
PROJECT_ROOT = Path(__file__).resolve().parent
GENERATED_DIALOGS_DIR = PROJECT_ROOT / "tts_generated_dialogs"
DIALOG_PAYLOAD = {
"output_base_name": "test_dialog_from_script",
"dialog_items": [
{
"type": "speech",
"speaker_id": "90fcd672-ba84-441a-ac6c-0449a59653bd", # Correct UUID for dummy_speaker
"text": "This is a test from the Python script. One, two, three.",
"exaggeration": 1.5,
"cfg_weight": 4.0,
"temperature": 0.5
},
{
"type": "silence",
"duration": 0.5
},
{
"type": "speech",
"speaker_id": "90fcd672-ba84-441a-ac6c-0449a59653bd",
"text": "Testing complete. All systems nominal."
},
{
"type": "speech",
"speaker_id": "non_existent_speaker", # Test case for invalid speaker
"text": "This should produce an error for this segment."
},
{
"type": "silence",
"duration": 0.25 # Changed to valid duration
}
]
}
def run_test():
print(f"Sending POST request to: {ENDPOINT_URL}")
print("Payload:")
print(json.dumps(DIALOG_PAYLOAD, indent=2))
print("-" * 50)
try:
start_time = time.time()
response = requests.post(ENDPOINT_URL, json=DIALOG_PAYLOAD, timeout=120) # Increased timeout for TTS processing
end_time = time.time()
print(f"Response received in {end_time - start_time:.2f} seconds.")
print(f"Status Code: {response.status_code}")
print("-" * 50)
if response.content:
try:
response_data = response.json()
print("Response JSON:")
print(json.dumps(response_data, indent=2))
print("-" * 50)
if response.status_code == 200:
print("Test PASSED (HTTP 200 OK)")
concatenated_url = response_data.get("concatenated_audio_url")
zip_url = response_data.get("zip_archive_url")
temp_dir = response_data.get("temp_dir_path")
if concatenated_url:
print(f"Concatenated audio URL: http://localhost:8000{concatenated_url}")
if zip_url:
print(f"ZIP archive URL: http://localhost:8000{zip_url}")
if temp_dir:
print(f"Temporary segment directory: {temp_dir}")
print("\nTo verify, check the generated files in:")
print(f" Concatenated/ZIP: {GENERATED_DIALOGS_DIR}")
print(f" Individual segments (if not cleaned up): {temp_dir}")
else:
print(f"Test FAILED (HTTP {response.status_code})")
if response_data.get("detail"):
print(f"Error Detail: {response_data.get('detail')}")
except json.JSONDecodeError:
print("Response content is not valid JSON:")
print(response.text)
print("Test FAILED (Invalid JSON Response)")
else:
print("Response content is empty.")
print(f"Test FAILED (Empty Response, HTTP {response.status_code})")
except requests.exceptions.ConnectionError as e:
print(f"Connection Error: {e}")
print("Test FAILED (Could not connect to the server. Is it running?)")
except requests.exceptions.Timeout as e:
print(f"Request Timeout: {e}")
print("Test FAILED (The request timed out. TTS processing might be too slow or stuck.)")
except Exception as e:
print(f"An unexpected error occurred: {e}")
print("Test FAILED (Unexpected error)")
def test_generate_line_speech():
url = f"{API_BASE_URL}/generate_line"
payload = {
"type": "speech",
"speaker_id": "90fcd672-ba84-441a-ac6c-0449a59653bd", # Correct UUID for dummy_speaker
"text": "This is a per-line TTS test.",
"exaggeration": 1.0,
"cfg_weight": 2.0,
"temperature": 0.8
}
print(f"\nTesting /generate_line with speech item: {payload}")
response = requests.post(url, json=payload)
print(f"Status: {response.status_code}")
try:
data = response.json()
print(f"Response: {json.dumps(data, indent=2)}")
if response.status_code == 200 and "audio_url" in data:
print("Speech line test PASSED.")
else:
print("Speech line test FAILED.")
except Exception as e:
print(f"Speech line test FAILED: {e}")
def test_generate_line_silence():
url = f"{API_BASE_URL}/generate_line"
payload = {
"type": "silence",
"duration": 1.25
}
print(f"\nTesting /generate_line with silence item: {payload}")
response = requests.post(url, json=payload)
print(f"Status: {response.status_code}")
try:
data = response.json()
print(f"Response: {json.dumps(data, indent=2)}")
if response.status_code == 200 and "audio_url" in data:
print("Silence line test PASSED.")
else:
print("Silence line test FAILED.")
except Exception as e:
print(f"Silence line test FAILED: {e}")
if __name__ == "__main__":
run_test()
test_generate_line_speech()
test_generate_line_silence()

31
backend/start_server.py Normal file
View File

@ -0,0 +1,31 @@
#!/usr/bin/env python3
"""
Backend server startup script that uses environment variables from config.
"""
import uvicorn
from app import config
if __name__ == "__main__":
print(f"Starting Chatterbox TTS Backend Server...")
print(f"Host: {config.HOST}")
print(f"Port: {config.PORT}")
print(f"Reload: {config.RELOAD}")
print(f"CORS Origins: {config.CORS_ORIGINS}")
print(f"Project Root: {config.PROJECT_ROOT}")
print(f"Device: {config.DEVICE}")
# Idle eviction settings
print(
"Model Eviction -> enabled: {} | idle_timeout: {}s | check_interval: {}s".format(
getattr(config, "MODEL_EVICTION_ENABLED", True),
getattr(config, "MODEL_IDLE_TIMEOUT_SECONDS", 0),
getattr(config, "MODEL_IDLE_CHECK_INTERVAL_SECONDS", 60),
)
)
uvicorn.run(
"app.main:app",
host=config.HOST,
port=config.PORT,
reload=config.RELOAD
)

496
cbx-audiobook.py Executable file
View File

@ -0,0 +1,496 @@
#!/usr/bin/env python
"""
Chatterbox Audiobook Generator
This script converts a text file into an audiobook using the Chatterbox TTS system.
It parses the text file into manageable chunks, generates audio for each chunk,
and assembles them into a complete audiobook.
"""
import argparse
import asyncio
import gc
import os
import re
import subprocess
import sys
import torch
from pathlib import Path
import uuid
# Import helper to fix Python path
import import_helper
# Import backend services
from backend.app.services.tts_service import TTSService
from backend.app.services.speaker_service import SpeakerManagementService
from backend.app.services.audio_manipulation_service import AudioManipulationService
from backend.app.config import DIALOG_GENERATED_DIR, TTS_TEMP_OUTPUT_DIR
class AudiobookGenerator:
def __init__(self, speaker_id, output_base_name, device="mps",
exaggeration=0.5, cfg_weight=0.5, temperature=0.8,
pause_between_sentences=0.5, pause_between_paragraphs=1.0,
keep_model_loaded=False, cleanup_interval=10, use_subprocess=False):
"""
Initialize the audiobook generator.
Args:
speaker_id: ID of the speaker to use
output_base_name: Base name for output files
device: Device to use for TTS (mps, cuda, cpu)
exaggeration: Controls expressiveness (0.0-1.0)
cfg_weight: Controls alignment with speaker characteristics (0.0-1.0)
temperature: Controls randomness in generation (0.0-1.0)
pause_between_sentences: Pause duration between sentences in seconds
pause_between_paragraphs: Pause duration between paragraphs in seconds
keep_model_loaded: If True, keeps model loaded across chunks (more efficient but uses more memory)
cleanup_interval: How often to perform deep cleanup when keep_model_loaded=True
use_subprocess: If True, uses separate processes for each chunk (slower but guarantees memory release)
"""
self.speaker_id = speaker_id
self.output_base_name = output_base_name
self.device = device
self.exaggeration = exaggeration
self.cfg_weight = cfg_weight
self.temperature = temperature
self.pause_between_sentences = pause_between_sentences
self.pause_between_paragraphs = pause_between_paragraphs
self.keep_model_loaded = keep_model_loaded
self.cleanup_interval = cleanup_interval
self.use_subprocess = use_subprocess
self.chunk_counter = 0
# Initialize services
self.tts_service = TTSService(device=device)
self.speaker_service = SpeakerManagementService()
self.audio_manipulator = AudioManipulationService()
# Create output directories
self.output_dir = DIALOG_GENERATED_DIR / output_base_name
self.output_dir.mkdir(parents=True, exist_ok=True)
self.temp_dir = TTS_TEMP_OUTPUT_DIR / output_base_name
self.temp_dir.mkdir(parents=True, exist_ok=True)
# Validate speaker
self._validate_speaker()
def _validate_speaker(self):
"""Validate that the specified speaker exists."""
speaker_info = self.speaker_service.get_speaker_by_id(self.speaker_id)
if not speaker_info:
raise ValueError(f"Speaker ID '{self.speaker_id}' not found.")
if not speaker_info.sample_path:
raise ValueError(f"Speaker ID '{self.speaker_id}' has no sample path defined.")
# Store speaker info for later use
self.speaker_info = speaker_info
def _cleanup_memory(self):
"""Force memory cleanup and garbage collection."""
print("Performing memory cleanup...")
# Force garbage collection multiple times for thorough cleanup
for _ in range(3):
gc.collect()
# Clear device-specific caches
if self.device == "cuda" and torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Additional CUDA cleanup
try:
torch.cuda.reset_peak_memory_stats()
except:
pass
elif self.device == "mps" and torch.backends.mps.is_available():
if hasattr(torch.mps, "empty_cache"):
torch.mps.empty_cache()
if hasattr(torch.mps, "synchronize"):
torch.mps.synchronize()
# Try to free MPS memory more aggressively
try:
import os
# This forces MPS to release memory back to the system
if hasattr(torch.mps, "set_per_process_memory_fraction"):
current_allocated = torch.mps.current_allocated_memory() if hasattr(torch.mps, "current_allocated_memory") else 0
if current_allocated > 0:
torch.mps.empty_cache()
except:
pass
# Additional aggressive cleanup
if hasattr(torch, '_C') and hasattr(torch._C, '_cuda_clearCublasWorkspaces'):
try:
torch._C._cuda_clearCublasWorkspaces()
except:
pass
print("Memory cleanup completed.")
async def _generate_chunk_subprocess(self, chunk, segment_filename_base, speaker_sample_path):
"""
Generate a single chunk using cbx-generate.py in a subprocess.
This guarantees memory is released when the process exits.
"""
output_file = self.temp_dir / f"{segment_filename_base}.wav"
# Use cbx-generate.py for single chunk generation
cmd = [
sys.executable, "cbx-generate.py",
"--sample", str(speaker_sample_path),
"--output", str(output_file),
"--text", chunk,
"--device", self.device
]
print(f"Running subprocess: {' '.join(cmd[:4])} ... (text truncated)")
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=300, # 5 minute timeout per chunk
cwd=Path(__file__).parent # Run from project root
)
if result.returncode != 0:
raise RuntimeError(f"Subprocess failed: {result.stderr}")
if not output_file.exists():
raise RuntimeError(f"Output file not created: {output_file}")
print(f"Subprocess completed successfully: {output_file}")
return output_file
except subprocess.TimeoutExpired:
raise RuntimeError(f"Subprocess timed out after 5 minutes")
except Exception as e:
raise RuntimeError(f"Subprocess error: {e}")
def split_text_into_chunks(self, text, max_length=300):
"""
Split text into chunks suitable for TTS processing.
This uses the same logic as the DialogProcessorService._split_text method
but adds additional paragraph handling.
"""
# Split text into paragraphs first
paragraphs = re.split(r'\n\s*\n', text)
paragraphs = [p.strip() for p in paragraphs if p.strip()]
all_chunks = []
for paragraph in paragraphs:
# Split paragraph into sentences
sentences = re.split(r'(?<=[.!?\u2026])\s+|(?<=[.!?\u2026])(?=[\"\')\]\}\u201d\u2019])|(?<=[.!?\u2026])$', paragraph.strip())
sentences = [s.strip() for s in sentences if s and s.strip()]
chunks = []
current_chunk = ""
for sentence in sentences:
if not sentence:
continue
if not current_chunk: # First sentence for this chunk
current_chunk = sentence
elif len(current_chunk) + len(sentence) + 1 <= max_length:
current_chunk += " " + sentence
else:
chunks.append(current_chunk)
current_chunk = sentence
if current_chunk: # Add the last chunk
chunks.append(current_chunk)
# Further split any chunks that are still too long
paragraph_chunks = []
for chunk in chunks:
if len(chunk) > max_length:
# Simple split by length if a sentence itself is too long
for i in range(0, len(chunk), max_length):
paragraph_chunks.append(chunk[i:i+max_length])
else:
paragraph_chunks.append(chunk)
# Add paragraph marker
if paragraph_chunks:
all_chunks.append({"type": "paragraph", "chunks": paragraph_chunks})
return all_chunks
async def generate_audiobook(self, text_file_path):
"""
Generate an audiobook from a text file.
Args:
text_file_path: Path to the text file to convert
Returns:
Path to the generated audiobook file
"""
# Read the text file
text_path = Path(text_file_path)
if not text_path.exists():
raise FileNotFoundError(f"Text file not found: {text_file_path}")
with open(text_path, 'r', encoding='utf-8') as f:
text = f.read()
print(f"Processing text file: {text_file_path}")
print(f"Text length: {len(text)} characters")
# Split text into chunks
paragraphs = self.split_text_into_chunks(text)
total_chunks = sum(len(p["chunks"]) for p in paragraphs)
print(f"Split into {len(paragraphs)} paragraphs with {total_chunks} total chunks")
# Generate audio for each chunk
segment_results = []
chunk_count = 0
# Pre-load model if keeping it loaded
if self.keep_model_loaded:
print("Pre-loading TTS model for batch processing...")
self.tts_service.load_model()
try:
for para_idx, paragraph in enumerate(paragraphs):
print(f"Processing paragraph {para_idx+1}/{len(paragraphs)}")
for chunk_idx, chunk in enumerate(paragraph["chunks"]):
chunk_count += 1
self.chunk_counter += 1
print(f" Generating audio for chunk {chunk_count}/{total_chunks}: {chunk[:50]}...")
# Generate unique filename for this chunk
segment_filename_base = f"{self.output_base_name}_p{para_idx}_c{chunk_idx}_{uuid.uuid4().hex[:8]}"
try:
# Get absolute speaker sample path
speaker_sample_path = Path(self.speaker_info.sample_path)
if not speaker_sample_path.is_absolute():
from backend.app.config import SPEAKER_DATA_BASE_DIR
speaker_sample_path = SPEAKER_DATA_BASE_DIR / speaker_sample_path
# Generate speech for this chunk
if self.use_subprocess:
# Use subprocess for guaranteed memory release
segment_output_path = await self._generate_chunk_subprocess(
chunk=chunk,
segment_filename_base=segment_filename_base,
speaker_sample_path=speaker_sample_path
)
else:
# Load model for this chunk (if not keeping loaded)
if not self.keep_model_loaded:
print("Loading TTS model...")
self.tts_service.load_model()
# Generate speech using the TTS service
segment_output_path = await self.tts_service.generate_speech(
text=chunk,
speaker_id=self.speaker_id,
speaker_sample_path=str(speaker_sample_path),
output_filename_base=segment_filename_base,
output_dir=self.temp_dir,
exaggeration=self.exaggeration,
cfg_weight=self.cfg_weight,
temperature=self.temperature
)
# Memory management strategy based on model lifecycle
if self.use_subprocess:
# No memory management needed - subprocess handles it
pass
elif self.keep_model_loaded:
# Light cleanup after each chunk
if self.chunk_counter % self.cleanup_interval == 0:
print(f"Performing periodic deep cleanup (chunk {self.chunk_counter})")
self._cleanup_memory()
else:
# Explicit memory cleanup after generation
self._cleanup_memory()
# Unload model after generation
print("Unloading TTS model...")
self.tts_service.unload_model()
# Additional memory cleanup after model unload
self._cleanup_memory()
# Add to segment results
segment_results.append({
"type": "speech",
"path": str(segment_output_path)
})
# Add pause between sentences
if chunk_idx < len(paragraph["chunks"]) - 1:
segment_results.append({
"type": "silence",
"duration": self.pause_between_sentences
})
except Exception as e:
print(f"Error generating speech for chunk: {e}")
# Ensure model is unloaded if there was an error and not using subprocess
if not self.use_subprocess:
if not self.keep_model_loaded and self.tts_service.model is not None:
print("Unloading TTS model after error...")
self.tts_service.unload_model()
# Force cleanup after error
self._cleanup_memory()
# Continue with next chunk
# Add longer pause between paragraphs
if para_idx < len(paragraphs) - 1:
segment_results.append({
"type": "silence",
"duration": self.pause_between_paragraphs
})
finally:
# Always unload model at the end if it was kept loaded
if self.keep_model_loaded and self.tts_service.model is not None:
print("Final cleanup: Unloading TTS model...")
self.tts_service.unload_model()
self._cleanup_memory()
# Concatenate all segments
print("Concatenating audio segments...")
concatenated_filename = f"{self.output_base_name}_audiobook.wav"
concatenated_path = self.output_dir / concatenated_filename
self.audio_manipulator.concatenate_audio_segments(
segment_results=segment_results,
output_concatenated_path=concatenated_path
)
# Create ZIP archive with all files
print("Creating ZIP archive...")
zip_filename = f"{self.output_base_name}_audiobook.zip"
zip_path = self.output_dir / zip_filename
# Collect all speech segment files
speech_segment_paths = [
Path(s["path"]) for s in segment_results
if s["type"] == "speech" and Path(s["path"]).exists()
]
self.audio_manipulator.create_zip_archive(
segment_file_paths=speech_segment_paths,
concatenated_audio_path=concatenated_path,
output_zip_path=zip_path
)
print(f"Audiobook generation complete!")
print(f"Audiobook file: {concatenated_path}")
print(f"ZIP archive: {zip_path}")
# Ensure model is unloaded at the end (just in case)
if self.tts_service.model is not None:
print("Final check: Unloading TTS model...")
self.tts_service.unload_model()
return concatenated_path
async def main():
parser = argparse.ArgumentParser(description="Generate an audiobook from a text file using Chatterbox TTS")
# Create a mutually exclusive group for the main operation vs listing speakers
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument("--list-speakers", action="store_true", help="List available speakers and exit")
group.add_argument("text_file", nargs="?", help="Path to the text file to convert")
# Other arguments
parser.add_argument("--speaker", "-s", help="ID of the speaker to use")
parser.add_argument("--output", "-o", help="Base name for output files (default: derived from text filename)")
parser.add_argument("--device", default="mps", choices=["mps", "cuda", "cpu"], help="Device to use for TTS (default: mps)")
parser.add_argument("--exaggeration", type=float, default=0.5, help="Controls expressiveness (0.0-1.0, default: 0.5)")
parser.add_argument("--cfg-weight", type=float, default=0.5, help="Controls alignment with speaker (0.0-1.0, default: 0.5)")
parser.add_argument("--temperature", type=float, default=0.8, help="Controls randomness (0.0-1.0, default: 0.8)")
parser.add_argument("--sentence-pause", type=float, default=0.5, help="Pause between sentences in seconds (default: 0.5)")
parser.add_argument("--paragraph-pause", type=float, default=1.0, help="Pause between paragraphs in seconds (default: 1.0)")
parser.add_argument("--keep-model-loaded", action="store_true", help="Keep model loaded between chunks (faster but uses more memory)")
parser.add_argument("--cleanup-interval", type=int, default=10, help="How often to perform deep cleanup when keeping model loaded (default: 10)")
parser.add_argument("--force-cpu-on-oom", action="store_true", help="Automatically switch to CPU if MPS/CUDA runs out of memory")
parser.add_argument("--max-chunk-length", type=int, default=300, help="Maximum chunk length for text splitting (default: 300)")
parser.add_argument("--use-subprocess", action="store_true", help="Use separate processes for each chunk (guarantees memory release but slower)")
args = parser.parse_args()
# List speakers if requested
if args.list_speakers:
speaker_service = SpeakerManagementService()
speakers = speaker_service.get_speakers()
print("Available speakers:")
for speaker in speakers:
print(f" {speaker.id}: {speaker.name}")
return
# Validate required arguments for audiobook generation
if not args.text_file:
parser.error("text_file is required when not using --list-speakers")
if not args.speaker:
parser.error("--speaker/-s is required when not using --list-speakers")
# Determine output base name if not provided
if not args.output:
text_path = Path(args.text_file)
args.output = text_path.stem
try:
# Create audiobook generator
generator = AudiobookGenerator(
speaker_id=args.speaker,
output_base_name=args.output,
device=args.device,
exaggeration=args.exaggeration,
cfg_weight=args.cfg_weight,
temperature=args.temperature,
pause_between_sentences=args.sentence_pause,
pause_between_paragraphs=args.paragraph_pause,
keep_model_loaded=args.keep_model_loaded,
cleanup_interval=args.cleanup_interval,
use_subprocess=args.use_subprocess
)
# Generate audiobook with automatic fallback
try:
await generator.generate_audiobook(args.text_file)
except (RuntimeError, torch.OutOfMemoryError) as e:
if args.force_cpu_on_oom and "out of memory" in str(e).lower() and args.device != "cpu":
print(f"\n⚠️ {args.device.upper()} out of memory: {e}")
print("🔄 Automatically switching to CPU and retrying...")
# Create new generator with CPU
generator = AudiobookGenerator(
speaker_id=args.speaker,
output_base_name=args.output,
device="cpu",
exaggeration=args.exaggeration,
cfg_weight=args.cfg_weight,
temperature=args.temperature,
pause_between_sentences=args.sentence_pause,
pause_between_paragraphs=args.paragraph_pause,
keep_model_loaded=args.keep_model_loaded,
cleanup_interval=args.cleanup_interval,
use_subprocess=args.use_subprocess
)
await generator.generate_audiobook(args.text_file)
print("✅ Successfully completed using CPU fallback!")
else:
raise
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return 1
return 0
if __name__ == "__main__":
sys.exit(asyncio.run(main()))

View File

@ -6,6 +6,9 @@ import yaml
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
# Import helper to fix Python path
import import_helper
def split_text_at_sentence_boundaries(text, max_length=300):
"""
Split text at sentence boundaries, ensuring each chunk is <= max_length.

View File

@ -1,22 +1,77 @@
import argparse
import gc
import torch
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
from contextlib import contextmanager
# Import helper to fix Python path
import import_helper
def safe_load_chatterbox_tts(device):
"""
Safely load ChatterboxTTS model with device mapping to handle CUDA->MPS/CPU conversion.
This patches torch.load temporarily to map CUDA tensors to the appropriate device.
"""
@contextmanager
def patch_torch_load(target_device):
original_load = torch.load
def patched_load(*args, **kwargs):
# Add map_location to handle device mapping
if 'map_location' not in kwargs:
if target_device == "mps" and torch.backends.mps.is_available():
kwargs['map_location'] = torch.device('mps')
else:
kwargs['map_location'] = torch.device('cpu')
return original_load(*args, **kwargs)
torch.load = patched_load
try:
yield
finally:
torch.load = original_load
with patch_torch_load(device):
return ChatterboxTTS.from_pretrained(device=device)
def main():
parser = argparse.ArgumentParser(description="Chatterbox TTS audio generation")
parser.add_argument('--sample', required=True, type=str, help='Prompt/reference audio file (e.g. .wav, .mp3) for the voice')
parser.add_argument('--output', required=True, type=str, help='Output audio file path (should end with .wav)')
parser.add_argument('--text', required=True, type=str, help='Text to synthesize')
parser.add_argument('--device', default="mps", choices=["mps", "cuda", "cpu"], help='Device to use for TTS (default: mps)')
args = parser.parse_args()
# Load model on MPS (for Apple Silicon)
model = ChatterboxTTS.from_pretrained(device="mps")
model = None
wav = None
# Generate the audio
wav = model.generate(args.text, audio_prompt_path=args.sample)
# Save to output .wav
ta.save(args.output, wav, model.sr)
print(f"Generated audio saved to {args.output}")
try:
# Load model with safe device mapping
model = safe_load_chatterbox_tts(args.device)
# Generate the audio
with torch.no_grad():
wav = model.generate(args.text, audio_prompt_path=args.sample)
# Save to output .wav
ta.save(args.output, wav, model.sr)
print(f"Generated audio saved to {args.output}")
finally:
# Explicit cleanup
if wav is not None:
del wav
if model is not None:
del model
# Force cleanup
gc.collect()
if args.device == "cuda" and torch.cuda.is_available():
torch.cuda.empty_cache()
elif args.device == "mps" and torch.backends.mps.is_available():
if hasattr(torch.mps, "empty_cache"):
torch.mps.empty_cache()
if __name__ == '__main__':
main()

2
forge.yaml Normal file
View File

@ -0,0 +1,2 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/antinomyhq/forge/refs/heads/main/forge.schema.json
model: qwen/qwen3-coder

10
frontend/.env.example Normal file
View File

@ -0,0 +1,10 @@
# Frontend Configuration
# Copy this file to .env and adjust values as needed
# Backend API configuration
VITE_API_BASE_URL=http://localhost:8000
VITE_API_BASE_URL_WITH_PREFIX=http://localhost:8000/api
# Development server configuration
VITE_DEV_SERVER_PORT=8001
VITE_DEV_SERVER_HOST=127.0.0.1

763
frontend/css/style.css Normal file
View File

@ -0,0 +1,763 @@
/* CSS Custom Properties - Color Palette */
:root {
/* Primary Colors */
--primary-blue: #163b65;
--primary-blue-dark: #357ab8;
--primary-blue-darker: #205081;
/* Background Colors */
--bg-body: #f7f9fa;
--bg-white: #fff;
--bg-light: #f3f5f7;
--bg-lighter: #f9fafb;
--bg-blue-light: #f3f7fa;
--bg-blue-lighter: #eaf1fa;
--bg-gray-light: #f7f7f7;
/* Text Colors */
--text-primary: #222;
--text-secondary: #2b2b2b;
--text-tertiary: #333;
--text-white: #fff;
--text-blue: #153f6f;
--text-blue-dark: #357ab8;
--text-blue-darker: #205081;
/* Border Colors */
--border-light: #e5e7eb;
--border-medium: #cfd8dc;
--border-blue: #b5c6df;
--border-gray: #e3e3e3;
/* Status Colors */
--error-bg: #e74c3c;
--error-bg-dark: #c0392b;
--warning-bg: #f9e79f;
--warning-text: #b7950b;
--warning-border: #f7ca18;
/* Header/Footer */
--header-bg: #222e3a;
/* Shadows */
--shadow-light: rgba(44,62,80,0.04);
--shadow-medium: rgba(44,62,80,0.06);
--shadow-strong: rgba(44,62,80,0.07);
}
body {
font-family: 'Segoe UI', 'Roboto', 'Arial', sans-serif;
line-height: 1.7;
margin: 0;
padding: 0;
background-color: var(--bg-body);
color: var(--text-primary);
}
.container {
max-width: 1280px;
margin: 0 auto;
padding: 0 18px;
}
header {
background: var(--header-bg);
color: var(--text-white);
padding: 1.5rem 0 1rem 0;
text-align: center;
border-bottom: 3px solid var(--primary-blue);
}
h1 {
font-size: 2.4rem;
margin: 0;
letter-spacing: 1px;
}
main {
margin-top: 30px;
margin-bottom: 30px;
}
.panel-grid {
display: flex;
flex-wrap: wrap;
gap: 28px;
justify-content: space-between;
}
.panel {
flex: 1 1 320px;
min-width: 320px;
background: none;
box-shadow: none;
border: none;
padding: 0;
}
#results-display.panel {
flex: 1 1 100%;
min-width: 0;
margin-top: 32px;
}
/* Dialog Table Styles */
#dialog-items-table {
width: 100%;
border-collapse: collapse;
background: var(--bg-white);
border-radius: 8px;
overflow: hidden;
font-size: 1rem;
margin-bottom: 0;
table-layout: fixed;
}
#dialog-items-table th, #dialog-items-table td {
padding: 7px 10px;
border: 1px solid var(--border-light);
text-align: left;
vertical-align: middle;
font-weight: 300;
font-size: 0.97rem;
color: var(--text-secondary);
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
}
/* Widen the Text/Duration column */
#dialog-items-table th:nth-child(3), #dialog-items-table td:nth-child(3) {
min-width: 320px;
width: 50%;
font-weight: 300;
font-size: 1rem;
}
/* Allow wrapping for Text/Duration (3rd) column */
#dialog-items-table td:nth-child(3),
#dialog-items-table td.dialog-editable-cell {
white-space: pre-wrap; /* wrap text and preserve newlines */
overflow: visible; /* override global overflow hidden */
text-overflow: clip; /* no ellipsis */
word-break: break-word;/* wrap long words/URLs */
color: var(--text-primary); /* darker text for readability */
font-weight: 350; /* slightly heavier than 300, lighter than 400 */
}
/* Make the Speaker (2nd) column narrower */
#dialog-items-table th:nth-child(2), #dialog-items-table td:nth-child(2) {
width: 60px;
min-width: 60px;
max-width: 60px;
text-align: center;
}
/* Actions (4th) column sizing */
#dialog-items-table th:nth-child(4), #dialog-items-table td:nth-child(4) {
width: 200px;
min-width: 180px;
max-width: 280px;
text-align: left;
padding-left: 0;
padding-right: 0;
}
#dialog-items-table th:first-child, #dialog-items-table td.type-icon-cell {
width: 44px;
min-width: 36px;
max-width: 48px;
text-align: center;
padding-left: 0;
padding-right: 0;
}
.type-icon-cell {
text-align: center;
vertical-align: middle;
}
.dialog-type-icon {
font-size: 1.4em;
display: inline-block;
line-height: 1;
vertical-align: middle;
}
#dialog-items-table th {
background: var(--bg-light);
color: var(--primary-blue);
font-weight: 600;
font-size: 1.05rem;
}
#dialog-items-table tr:last-child td {
border-bottom: none;
}
#dialog-items-table td.actions {
text-align: left;
min-width: 200px;
white-space: normal; /* allow wrapping so we don't see ellipsis */
overflow: visible; /* override table cell default from global rule */
text-overflow: clip; /* no ellipsis */
}
/* Allow wrapping of action buttons on smaller screens */
@media (max-width: 900px) {
#dialog-items-table th:nth-child(4), #dialog-items-table td:nth-child(4) {
width: auto;
min-width: 160px;
max-width: none;
}
#dialog-items-table td.actions {
white-space: normal;
}
}
/* Collapsible log details */
details#generation-log-details {
margin-bottom: 0;
border-radius: 4px;
background: var(--bg-light);
box-shadow: 0 1px 3px var(--shadow-light);
padding: 0 0 0 0;
transition: box-shadow 0.15s;
}
details#generation-log-details[open] {
box-shadow: 0 2px 8px var(--shadow-strong);
background: var(--bg-lighter);
}
details#generation-log-details summary {
font-size: 1rem;
color: var(--text-blue);
padding: 10px 0 6px 0;
outline: none;
}
details#generation-log-details summary:focus {
outline: 2px solid var(--primary-blue);
border-radius: 3px;
}
@media (max-width: 900px) {
.panel-grid {
display: block;
gap: 0;
}
.panel, .full-width-panel {
min-width: 0;
width: 100%;
flex: 1 1 100%;
}
#dialog-items-table th, #dialog-items-table td {
font-size: 0.97rem;
padding: 7px 8px;
}
#speaker-management.panel {
margin-bottom: 36px;
width: 100%;
max-width: 100%;
flex: 1 1 100%;
}
}
.card {
background: var(--bg-white);
border-radius: 8px;
box-shadow: 0 2px 8px var(--shadow-medium);
padding: 18px 20px;
margin-bottom: 18px;
}
section {
margin-bottom: 0;
border-radius: 0;
padding: 0;
background: none;
}
hr {
display: none;
}
h2 {
font-size: 1.5rem;
margin-top: 0;
margin-bottom: 16px;
color: var(--primary-blue);
letter-spacing: 0.5px;
}
h3 {
font-size: 1.1rem;
margin-bottom: 10px;
color: var(--text-tertiary);
}
.x-remove-btn {
background: var(--error-bg);
color: var(--text-white);
border: none;
border-radius: 50%;
width: 32px;
height: 32px;
font-size: 1.25rem;
line-height: 1;
display: inline-flex;
align-items: center;
justify-content: center;
cursor: pointer;
transition: background 0.15s;
margin: 0 3px;
box-shadow: 0 1px 2px var(--shadow-light);
outline: none;
padding: 0;
vertical-align: middle;
}
.x-remove-btn:hover, .x-remove-btn:focus {
background: var(--error-bg-dark);
color: var(--text-white);
outline: 2px solid var(--error-bg);
}
.form-row {
display: flex;
align-items: center;
gap: 12px;
margin-bottom: 14px;
}
label {
min-width: 120px;
font-weight: 500;
margin-bottom: 0;
}
input[type='text'], input[type='file'], textarea {
padding: 8px 10px;
border: 1px solid var(--border-medium);
border-radius: 4px;
font-size: 1rem;
width: 100%;
box-sizing: border-box;
}
input[type='file'] {
background: var(--bg-gray-light);
font-size: 0.97rem;
}
.dialog-edit-textarea {
min-height: 60px;
resize: vertical;
font-family: inherit;
line-height: 1.4;
}
button {
padding: 9px 18px;
background: var(--primary-blue);
color: var(--text-white);
border: none;
border-radius: 5px;
cursor: pointer;
font-size: 1rem;
font-weight: 500;
transition: background 0.15s;
margin-right: 10px;
}
.generate-line-btn, .play-line-btn, .stop-line-btn {
background: var(--bg-blue-light);
color: var(--text-blue);
border: 1.5px solid var(--border-blue);
border-radius: 50%;
width: 32px;
height: 32px;
font-size: 1.25rem;
display: inline-flex;
align-items: center;
justify-content: center;
margin: 0 3px;
padding: 0;
box-shadow: 0 1px 2px var(--shadow-light);
vertical-align: middle;
}
.generate-line-btn:disabled, .play-line-btn:disabled, .stop-line-btn:disabled {
opacity: 0.45;
cursor: not-allowed;
}
.generate-line-btn.loading {
background: var(--warning-bg);
color: var(--warning-text);
border-color: var(--warning-border);
}
.generate-line-btn:hover, .play-line-btn:hover, .stop-line-btn:hover {
background: var(--bg-blue-lighter);
color: var(--text-blue-darker);
border-color: var(--text-blue);
}
button:hover, button:focus {
background: var(--primary-blue-dark);
outline: none;
}
.dialog-controls {
margin-bottom: 10px;
}
#speaker-list {
list-style: none;
padding: 0;
margin: 0;
}
#speaker-list li {
padding: 7px 0;
border-bottom: 1px solid var(--border-gray);
display: flex;
justify-content: space-between;
align-items: center;
}
#speaker-list li:last-child {
border-bottom: none;
}
pre {
background: var(--bg-light);
padding: 12px;
border-radius: 4px;
font-size: 0.98rem;
white-space: pre-wrap;
word-wrap: break-word;
margin: 0;
}
audio {
width: 100%;
margin-top: 8px;
margin-bottom: 8px;
}
#zip-archive-link {
display: inline-block;
margin-right: 10px;
color: var(--text-white);
background: var(--primary-blue);
padding: 7px 16px;
border-radius: 4px;
text-decoration: none;
font-weight: 500;
transition: background 0.15s;
}
#zip-archive-link:hover, #zip-archive-link:focus {
background: var(--primary-blue-dark);
}
footer {
text-align: center;
padding: 20px 0;
background: var(--header-bg);
color: var(--text-white);
margin-top: 40px;
font-size: 1rem;
border-top: 3px solid var(--primary-blue);
}
/* Inline Notification */
.notice {
max-width: 1280px;
margin: 16px auto 0;
padding: 12px 16px;
border-radius: 6px;
border: 1px solid var(--border-medium);
background: var(--bg-white);
color: var(--text-primary);
display: flex;
align-items: center;
gap: 12px;
box-shadow: 0 1px 2px var(--shadow-light);
}
.notice--info {
border-color: var(--border-blue);
background: var(--bg-blue-light);
}
.notice--success {
border-color: #A7F3D0;
background: #ECFDF5;
}
.notice--warning {
border-color: var(--warning-border);
background: var(--warning-bg);
}
.notice--error {
border-color: var(--error-bg-dark);
background: #FEE2E2;
}
.notice__content {
flex: 1;
}
.notice__actions {
display: flex;
gap: 8px;
}
.notice__actions button {
padding: 6px 12px;
border-radius: 4px;
border: 1px solid var(--border-medium);
background: var(--bg-white);
cursor: pointer;
}
.notice__actions .btn-primary {
background: var(--primary-blue);
color: var(--text-white);
border: none;
}
.notice__close {
background: none;
border: none;
font-size: 18px;
cursor: pointer;
color: var(--text-secondary);
}
@media (max-width: 900px) {
.panel-grid {
flex-direction: column;
gap: 22px;
}
.panel {
min-width: 0;
}
}
/* Simple side-by-side layout for speaker management */
.speaker-mgmt-row {
display: flex;
gap: 20px;
}
.speaker-mgmt-row .card {
flex: 1;
width: 50%;
}
/* Stack on mobile */
@media (max-width: 768px) {
.speaker-mgmt-row {
flex-direction: column;
}
.speaker-mgmt-row .card {
width: 100%;
}
}
.move-up-btn, .move-down-btn {
background: var(--bg-blue-light);
color: var(--text-blue);
border: 1.5px solid var(--border-blue);
border-radius: 50%;
width: 32px;
height: 32px;
font-size: 1.25rem;
display: inline-flex;
align-items: center;
justify-content: center;
margin: 0 3px;
padding: 0;
box-shadow: 0 1px 2px var(--shadow-light);
vertical-align: middle;
cursor: pointer;
transition: background 0.15s, color 0.15s, border-color 0.15s;
}
.move-up-btn:disabled, .move-down-btn:disabled {
opacity: 0.45;
cursor: not-allowed;
}
.move-up-btn:hover:not(:disabled), .move-down-btn:hover:not(:disabled) {
background: var(--bg-blue-lighter);
color: var(--text-blue-darker);
border-color: var(--text-blue);
}
.move-up-btn:focus, .move-down-btn:focus {
outline: 2px solid var(--primary-blue);
outline-offset: 2px;
}
/* TTS Settings Modal */
.modal {
position: fixed;
top: 0;
left: 0;
width: 100%;
height: 100%;
background-color: rgba(0, 0, 0, 0.5);
z-index: 1000;
display: flex;
align-items: center;
justify-content: center;
}
.modal-content {
background: var(--bg-white);
border-radius: 8px;
box-shadow: var(--shadow-strong);
max-width: 500px;
width: 90%;
max-height: 80vh;
overflow-y: auto;
}
.modal-header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 20px 24px 16px;
border-bottom: 1px solid var(--border-light);
}
.modal-header h3 {
margin: 0;
color: var(--text-primary);
font-size: 1.25rem;
}
.modal-close {
background: none;
border: none;
font-size: 24px;
cursor: pointer;
color: var(--text-secondary);
padding: 4px;
border-radius: 4px;
transition: background-color 0.2s;
}
.modal-close:hover {
background-color: var(--bg-light);
}
.modal-body {
padding: 20px 24px;
}
.settings-group {
margin-bottom: 20px;
}
.settings-group label {
display: block;
margin-bottom: 8px;
font-weight: 500;
color: var(--text-primary);
}
.settings-group input[type="range"] {
width: 100%;
margin-bottom: 4px;
}
.settings-group span {
display: inline-block;
min-width: 40px;
font-weight: 500;
color: var(--primary-blue);
margin-left: 8px;
}
.settings-group small {
display: block;
color: var(--text-secondary);
font-size: 0.875rem;
margin-top: 4px;
line-height: 1.3;
}
.modal-footer {
display: flex;
gap: 12px;
justify-content: flex-end;
padding: 16px 24px 20px;
border-top: 1px solid var(--border-light);
}
.btn-primary, .btn-secondary {
padding: 8px 16px;
border-radius: 4px;
border: none;
cursor: pointer;
font-size: 0.875rem;
font-weight: 500;
transition: all 0.2s;
}
.btn-primary {
background-color: var(--primary-blue);
color: var(--text-white);
}
.btn-primary:hover {
background-color: var(--primary-blue-dark);
}
.btn-secondary {
background-color: var(--bg-light);
color: var(--text-secondary);
border: 1px solid var(--border-medium);
}
.btn-secondary:hover {
background-color: var(--border-light);
}
/* Settings button styling */
.settings-line-btn {
width: 32px;
height: 32px;
border-radius: 50%;
border: none;
background-color: var(--bg-light);
color: var(--text-secondary);
cursor: pointer;
font-size: 14px;
margin: 0 2px;
transition: all 0.2s;
display: inline-flex;
align-items: center;
justify-content: center;
vertical-align: middle;
}
.settings-line-btn:hover {
background-color: var(--primary-blue);
color: var(--text-white);
transform: scale(1.05);
}
.settings-line-btn:disabled {
opacity: 0.5;
cursor: not-allowed;
transform: none;
}

172
frontend/index.html Normal file
View File

@ -0,0 +1,172 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Chatterbox TTS Frontend</title>
<link rel="stylesheet" href="css/style.css">
</head>
<body>
<header>
<div class="container">
<h1>Chatterbox TTS</h1>
</div>
<!-- Paste Script Modal -->
<div id="paste-script-modal" class="modal" style="display: none;">
<div class="modal-content">
<div class="modal-header">
<h3>Paste Dialog Script</h3>
<button class="modal-close" id="paste-script-close">&times;</button>
</div>
<div class="modal-body">
<p>Paste JSONL content (one JSON object per line). Example lines:</p>
<pre style="white-space:pre-wrap; background:#f6f8fa; padding:8px; border-radius:4px;">
{"type":"speech","speaker_id":"alice","text":"Hello there!"}
{"type":"silence","duration":0.5}
{"type":"speech","speaker_id":"bob","text":"Hi!"}
</pre>
<textarea id="paste-script-text" rows="10" style="width:100%;" placeholder='Paste JSONL here'></textarea>
</div>
<div class="modal-footer">
<button id="paste-script-load" class="btn-primary">Load</button>
<button id="paste-script-cancel" class="btn-secondary">Cancel</button>
</div>
</div>
</div>
</header>
<!-- Global inline notification area -->
<div id="global-notice" class="notice" role="status" aria-live="polite" style="display:none;">
<div class="notice__content" id="global-notice-content"></div>
<div class="notice__actions" id="global-notice-actions"></div>
<button class="notice__close" id="global-notice-close" aria-label="Close notification">&times;</button>
</div>
<main class="container" role="main">
<div class="panel-grid">
<section id="dialog-editor" class="panel full-width-panel" aria-labelledby="dialog-editor-title">
<h2 id="dialog-editor-title">Dialog Editor</h2>
<div class="card">
<table id="dialog-items-table">
<thead>
<tr>
<th>Type</th>
<th>Speaker</th>
<th>Text / Duration</th>
<th>Actions</th>
</tr>
</thead>
<tbody id="dialog-items-container">
<!-- Dialog items will be rendered here by JavaScript as <tr> -->
</tbody>
</table>
</div>
<div id="temp-input-area" class="card">
<!-- Temporary inputs for speech/silence will go here -->
</div>
<div class="dialog-controls form-row">
<button id="add-speech-line-btn">Add Speech Line</button>
<button id="add-silence-line-btn">Add Silence Line</button>
<button id="generate-dialog-btn">Generate Dialog</button>
</div>
<div class="dialog-controls form-row">
<label for="output-base-name">Output Base Name:</label>
<input type="text" id="output-base-name" name="output-base-name" value="dialog_output" required>
</div>
<div class="dialog-controls form-row">
<button id="save-script-btn">Save Script</button>
<input type="file" id="load-script-input" accept=".jsonl" style="display: none;">
<button id="load-script-btn">Load Script</button>
<button id="paste-script-btn">Paste Script</button>
</div>
</section>
</div>
<!-- Results below -->
<section id="results-display" class="panel" aria-labelledby="results-display-title">
<h2 id="results-display-title">Results</h2>
<div class="card">
<details id="generation-log-details">
<summary style="cursor:pointer;font-weight:500;">Show Generation Log</summary>
<pre id="generation-log-content" style="margin-top:12px;">(Generation log will appear here)</pre>
</details>
</div>
<div class="card">
<h3>Concatenated Audio:</h3>
<audio id="concatenated-audio-player" controls src=""></audio>
</div>
<div class="card">
<h3>Download Archive:</h3>
<a id="zip-archive-link" href="#" download style="display: none;">Download ZIP</a>
<p id="zip-archive-placeholder">(ZIP download link will appear here)</p>
</div>
</section>
<!-- Speaker management row below Results, side by side -->
<div class="speaker-mgmt-row">
<div id="speaker-list-container" class="card">
<h3>Available Speakers</h3>
<ul id="speaker-list">
<!-- Speakers will be populated here by JavaScript -->
</ul>
</div>
<div id="add-speaker-container" class="card">
<h3>Add New Speaker</h3>
<form id="add-speaker-form">
<div class="form-row">
<label for="speaker-name">Speaker Name:</label>
<input type="text" id="speaker-name" name="name" required>
</div>
<div class="form-row">
<label for="speaker-sample">Audio Sample (WAV or MP3):</label>
<input type="file" id="speaker-sample" name="audio_file" accept=".wav,.mp3" required>
</div>
<button type="submit">Add Speaker</button>
</form>
</div>
</div>
</main>
<footer>
<div class="container">
<p>&copy; 2024 Chatterbox TTS</p>
</div>
</footer>
<!-- TTS Settings Modal -->
<div id="tts-settings-modal" class="modal" style="display: none;">
<div class="modal-content">
<div class="modal-header">
<h3>TTS Settings</h3>
<button class="modal-close" id="tts-modal-close">&times;</button>
</div>
<div class="modal-body">
<div class="settings-group">
<label for="tts-exaggeration">Exaggeration:</label>
<input type="range" id="tts-exaggeration" min="0" max="2" step="0.1" value="0.5">
<span id="tts-exaggeration-value">0.5</span>
<small>Controls expressiveness. Higher values = more exaggerated speech.</small>
</div>
<div class="settings-group">
<label for="tts-cfg-weight">CFG Weight:</label>
<input type="range" id="tts-cfg-weight" min="0" max="2" step="0.1" value="0.5">
<span id="tts-cfg-weight-value">0.5</span>
<small>Alignment with prompt. Higher values = more aligned with speaker characteristics.</small>
</div>
<div class="settings-group">
<label for="tts-temperature">Temperature:</label>
<input type="range" id="tts-temperature" min="0" max="2" step="0.1" value="0.8">
<span id="tts-temperature-value">0.8</span>
<small>Randomness. Lower values = more deterministic, higher = more varied.</small>
</div>
</div>
<div class="modal-footer">
<button id="tts-settings-save" class="btn-primary">Save Settings</button>
<button id="tts-settings-cancel" class="btn-secondary">Cancel</button>
</div>
</div>
</div>
<script src="js/api.js" type="module"></script>
<script src="js/app.js" type="module" defer></script>
</body>
</html>

159
frontend/js/api.js Normal file
View File

@ -0,0 +1,159 @@
// frontend/js/api.js
import { API_BASE_URL_WITH_PREFIX } from './config.js';
const API_BASE_URL = API_BASE_URL_WITH_PREFIX;
/**
* Fetches the list of available speakers.
* @returns {Promise<Array<Object>>} A promise that resolves to an array of speaker objects.
* @throws {Error} If the network response is not ok.
*/
export async function getSpeakers() {
const response = await fetch(`${API_BASE_URL}/speakers`);
if (!response.ok) {
const errorData = await response.json().catch(() => ({ message: response.statusText }));
throw new Error(`Failed to fetch speakers: ${errorData.detail || errorData.message || response.statusText}`);
}
return response.json();
}
// We will add more functions here: addSpeaker, deleteSpeaker, generateDialog
// ... (keep API_BASE_URL and getSpeakers)
/**
* Adds a new speaker.
* @param {FormData} formData - The form data containing speaker name and audio file.
* Example: formData.append('name', 'New Speaker');
* formData.append('audio_file', fileInput.files[0]);
* @returns {Promise<Object>} A promise that resolves to the new speaker object.
* @throws {Error} If the network response is not ok.
*/
export async function addSpeaker(formData) {
const response = await fetch(`${API_BASE_URL}/speakers`, {
method: 'POST',
body: formData, // FormData sets Content-Type to multipart/form-data automatically
});
if (!response.ok) {
console.log('API_JS_ADD_SPEAKER: Entered !response.ok block. Status:', response.status, 'StatusText:', response.statusText);
let errorPayload = { detail: `Request failed with status ${response.status}` }; // Default payload
try {
console.log('API_JS_ADD_SPEAKER: Attempting to parse error response as JSON...');
errorPayload = await response.json();
console.log('API_JS_ADD_SPEAKER: Successfully parsed error JSON:', errorPayload);
} catch (e) {
console.warn('API_JS_ADD_SPEAKER: Failed to parse error response as JSON. Error:', e);
// Use statusText if JSON parsing fails
errorPayload = { detail: response.statusText || `Request failed with status ${response.status} and no JSON body.`, parseError: e.toString() };
}
console.error('--- BEGIN SERVER ERROR PAYLOAD (addSpeaker) ---');
console.error('Status:', response.status);
console.error('Status Text:', response.statusText);
console.error('Parsed Payload:', errorPayload);
console.error('--- END SERVER ERROR PAYLOAD (addSpeaker) ---');
let detailedMessage = "Unknown error";
if (errorPayload && errorPayload.detail) {
if (typeof errorPayload.detail === 'string') {
detailedMessage = errorPayload.detail;
} else {
// If detail is an array (FastAPI validation errors) or object, stringify it.
detailedMessage = JSON.stringify(errorPayload.detail);
}
} else if (errorPayload && errorPayload.message) {
detailedMessage = errorPayload.message;
} else if (response.statusText) {
detailedMessage = response.statusText;
} else {
detailedMessage = `HTTP error ${response.status}`;
}
console.log(`API_JS_ADD_SPEAKER: Constructed detailedMessage: "${detailedMessage}"`);
console.log(`API_JS_ADD_SPEAKER: Throwing error with message: "Failed to add speaker: ${detailedMessage}"`);
throw new Error(`Failed to add speaker: ${detailedMessage}`);
}
return response.json();
}
// ... (keep API_BASE_URL, getSpeakers, addSpeaker)
/**
* Deletes a speaker by their ID.
* @param {string} speakerId - The ID of the speaker to delete.
* @returns {Promise<Object>} A promise that resolves to the response data (e.g., success message).
* @throws {Error} If the network response is not ok.
*/
export async function deleteSpeaker(speakerId) {
const response = await fetch(`${API_BASE_URL}/speakers/${speakerId}`, {
method: 'DELETE',
});
if (!response.ok) {
const errorData = await response.json().catch(() => ({ message: response.statusText }));
throw new Error(`Failed to delete speaker ${speakerId}: ${errorData.detail || errorData.message || response.statusText}`);
}
// Handle 204 No Content specifically, as .json() would fail
if (response.status === 204) {
return { message: `Speaker ${speakerId} deleted successfully.` };
}
return response.json();
}
// ... (keep API_BASE_URL, getSpeakers, addSpeaker, deleteSpeaker)
/**
* Generates audio for a single dialog line (speech or silence).
* @param {Object} line - The dialog line object (type: 'speech' or 'silence').
* @returns {Promise<Object>} Resolves with { audio_url } on success.
* @throws {Error} If the network response is not ok.
*/
export async function generateLine(line) {
console.log('generateLine called with:', line);
const response = await fetch(`${API_BASE_URL}/dialog/generate_line`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(line),
});
console.log('Response status:', response.status);
console.log('Response headers:', [...response.headers.entries()]);
if (!response.ok) {
const errorData = await response.json().catch(() => ({ message: response.statusText }));
throw new Error(`Failed to generate line audio: ${errorData.detail || errorData.message || response.statusText}`);
}
const data = await response.json();
return data;
}
/**
* Generates a dialog by sending a payload to the backend.
* @param {Object} dialogPayload - The payload for dialog generation.
* Example:
* {
* output_base_name: "my_dialog",
* dialog_items: [
* { type: "speech", speaker_id: "speaker1", text: "Hello world.", exaggeration: 1.0, cfg_weight: 2.0, temperature: 0.7 },
* { type: "silence", duration: 0.5 },
* { type: "speech", speaker_id: "speaker2", text: "How are you?" }
* ]
* }
* @returns {Promise<Object>} A promise that resolves to the dialog generation response (log, file URLs).
* @throws {Error} If the network response is not ok.
*/
export async function generateDialog(dialogPayload) {
const response = await fetch(`${API_BASE_URL}/dialog/generate`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(dialogPayload),
});
if (!response.ok) {
const errorData = await response.json().catch(() => ({ message: response.statusText }));
throw new Error(`Failed to generate dialog: ${errorData.detail || errorData.message || response.statusText}`);
}
return response.json();
}

1188
frontend/js/app.js Normal file

File diff suppressed because it is too large Load Diff

42
frontend/js/config.js Normal file
View File

@ -0,0 +1,42 @@
// Frontend Configuration
// This file handles environment variable configuration for the frontend
// Get environment variables (these would be injected by a build tool like Vite)
// For now, we'll use defaults that can be overridden
const getEnvVar = (name, defaultValue) => {
// In a real Vite setup, this would be import.meta.env[name]
// For now, we'll check if there's a global config object or use defaults
if (typeof window !== 'undefined' && window.APP_CONFIG && window.APP_CONFIG[name]) {
return window.APP_CONFIG[name];
}
return defaultValue;
};
// API Configuration
// Default to the same hostname as the frontend, on port 8000 (override via VITE_API_BASE_URL*)
const _defaultHost = (typeof window !== 'undefined' && window.location?.hostname) || 'localhost';
const _defaultPort = getEnvVar('VITE_API_BASE_URL_PORT', '8000');
const _defaultBase = `http://${_defaultHost}:${_defaultPort}`;
export const API_BASE_URL = getEnvVar('VITE_API_BASE_URL', _defaultBase);
export const API_BASE_URL_WITH_PREFIX = getEnvVar(
'VITE_API_BASE_URL_WITH_PREFIX',
`${_defaultBase}/api`
);
// For file serving (same as API_BASE_URL since files are served from the same server)
export const API_BASE_URL_FOR_FILES = API_BASE_URL;
// Development server configuration
export const DEV_SERVER_PORT = getEnvVar('VITE_DEV_SERVER_PORT', '8001');
export const DEV_SERVER_HOST = getEnvVar('VITE_DEV_SERVER_HOST', '127.0.0.1');
// Export all config as a single object for convenience
export const CONFIG = {
API_BASE_URL,
API_BASE_URL_WITH_PREFIX,
API_BASE_URL_FOR_FILES,
DEV_SERVER_PORT,
DEV_SERVER_HOST
};
export default CONFIG;

View File

@ -0,0 +1,46 @@
#!/usr/bin/env python3
"""
Simple development server for the frontend that reads configuration from .env
"""
import os
import http.server
import socketserver
from pathlib import Path
# Try to load environment variables, but don't fail if dotenv is not available
try:
from dotenv import load_dotenv
load_dotenv()
except ImportError:
print("python-dotenv not installed, using system environment variables only")
# Configuration
PORT = int(os.getenv('VITE_DEV_SERVER_PORT', '8001'))
HOST = os.getenv('VITE_DEV_SERVER_HOST', '127.0.0.1')
# Change to frontend directory
frontend_dir = Path(__file__).parent
os.chdir(frontend_dir)
class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler):
def end_headers(self):
# Add CORS headers for development
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
self.send_header('Access-Control-Allow-Headers', '*')
super().end_headers()
if __name__ == "__main__":
print(f"Starting Frontend Development Server...")
print(f"Host: {HOST}")
print(f"Port: {PORT}")
print(f"Serving from: {frontend_dir}")
print(f"Open: http://{HOST}:{PORT}")
with socketserver.TCPServer((HOST, PORT), MyHTTPRequestHandler) as httpd:
print(f"Server running at http://{HOST}:{PORT}/")
try:
httpd.serve_forever()
except KeyboardInterrupt:
print("\nShutting down server...")

196
frontend/tests/api.test.js Normal file
View File

@ -0,0 +1,196 @@
// frontend/tests/api.test.js
// Import the function to test (adjust path if your structure is different)
// We might need to configure Jest or use Babel for ES module syntax if this causes issues.
import { getSpeakers, addSpeaker, deleteSpeaker, generateDialog } from '../js/api.js';
// Mock the global fetch function
global.fetch = jest.fn();
const API_BASE_URL = 'http://localhost:8000/api'; // Centralize for all tests
describe('API Client - getSpeakers', () => {
beforeEach(() => {
// Clear all instances and calls to constructor and all methods:
fetch.mockClear();
});
it('should fetch speakers successfully', async () => {
const mockSpeakers = [{ id: '1', name: 'Speaker 1' }, { id: '2', name: 'Speaker 2' }];
fetch.mockResolvedValueOnce({
ok: true,
json: async () => mockSpeakers,
});
const speakers = await getSpeakers();
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers`);
expect(speakers).toEqual(mockSpeakers);
});
it('should throw an error if the network response is not ok', async () => {
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Not Found',
json: async () => ({ detail: 'Speakers not found' }) // Simulate FastAPI error response
});
await expect(getSpeakers()).rejects.toThrow('Failed to fetch speakers: Speakers not found');
expect(fetch).toHaveBeenCalledTimes(1);
});
it('should throw a generic error if parsing error response fails', async () => {
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Internal Server Error',
json: async () => { throw new Error('Failed to parse error JSON'); } // Simulate error during .json()
});
await expect(getSpeakers()).rejects.toThrow('Failed to fetch speakers: Internal Server Error');
expect(fetch).toHaveBeenCalledTimes(1);
});
it('should throw an error if fetch itself fails (network error)', async () => {
fetch.mockRejectedValueOnce(new TypeError('Network failed'));
await expect(getSpeakers()).rejects.toThrow('Network failed'); // This will be the original fetch error
expect(fetch).toHaveBeenCalledTimes(1);
});
});
describe('API Client - addSpeaker', () => {
beforeEach(() => {
fetch.mockClear();
});
it('should add a speaker successfully', async () => {
const mockFormData = new FormData(); // In a real scenario, this would have data
mockFormData.append('name', 'Test Speaker');
// mockFormData.append('audio_sample_file', new File([''], 'sample.wav')); // File creation in Node test needs more setup or a mock
const mockResponse = { id: '3', name: 'Test Speaker', message: 'Speaker added successfully' };
fetch.mockResolvedValueOnce({
ok: true,
json: async () => mockResponse,
});
const result = await addSpeaker(mockFormData);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers`, {
method: 'POST',
body: mockFormData,
});
expect(result).toEqual(mockResponse);
});
it('should throw an error if adding a speaker fails', async () => {
const mockFormData = new FormData();
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Bad Request',
json: async () => ({ detail: 'Invalid speaker data' }),
});
await expect(addSpeaker(mockFormData)).rejects.toThrow('Failed to add speaker: Invalid speaker data');
expect(fetch).toHaveBeenCalledTimes(1);
});
});
describe('API Client - deleteSpeaker', () => {
beforeEach(() => {
fetch.mockClear();
});
it('should delete a speaker successfully with JSON response', async () => {
const speakerId = 'test-speaker-id-123';
const mockResponse = { message: `Speaker ${speakerId} deleted successfully` };
fetch.mockResolvedValueOnce({
ok: true,
status: 200, // Or any 2xx status that might return JSON
json: async () => mockResponse,
});
const result = await deleteSpeaker(speakerId);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers/${speakerId}`, {
method: 'DELETE',
});
expect(result).toEqual(mockResponse);
});
it('should handle successful deletion with 204 No Content response', async () => {
const speakerId = 'test-speaker-id-204';
fetch.mockResolvedValueOnce({
ok: true,
status: 204,
statusText: 'No Content',
// .json() is not called by the function if status is 204
});
const result = await deleteSpeaker(speakerId);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/speakers/${speakerId}`, {
method: 'DELETE',
});
expect(result).toEqual({ message: `Speaker ${speakerId} deleted successfully.` });
});
it('should throw an error if deleting a speaker fails (e.g., speaker not found)', async () => {
const speakerId = 'non-existent-speaker-id';
fetch.mockResolvedValueOnce({
ok: false,
status: 404,
statusText: 'Not Found',
json: async () => ({ detail: 'Speaker not found' }),
});
await expect(deleteSpeaker(speakerId)).rejects.toThrow(`Failed to delete speaker ${speakerId}: Speaker not found`);
expect(fetch).toHaveBeenCalledTimes(1);
});
});
describe('API Client - generateDialog', () => {
beforeEach(() => {
fetch.mockClear();
});
it('should generate dialog successfully', async () => {
const mockPayload = {
output_base_name: "test_dialog",
dialog_items: [
{ type: "speech", speaker_id: "spk_1", text: "Hello.", exaggeration: 1.0, cfg_weight: 3.0, temperature: 0.5 },
{ type: "silence", duration_ms: 250 }
]
};
const mockResponse = {
log: "Dialog generated.",
concatenated_audio_url: "/audio/test_dialog_concatenated.wav",
zip_archive_url: "/audio/test_dialog.zip"
};
fetch.mockResolvedValueOnce({
ok: true,
json: async () => mockResponse,
});
const result = await generateDialog(mockPayload);
expect(fetch).toHaveBeenCalledTimes(1);
expect(fetch).toHaveBeenCalledWith(`${API_BASE_URL}/dialog/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(mockPayload),
});
expect(result).toEqual(mockResponse);
});
it('should throw an error if dialog generation fails', async () => {
const mockPayload = { output_base_name: "fail_dialog", dialog_items: [] }; // Example invalid payload
fetch.mockResolvedValueOnce({
ok: false,
statusText: 'Bad Request',
json: async () => ({ detail: 'Invalid dialog data' }),
});
await expect(generateDialog(mockPayload)).rejects.toThrow('Failed to generate dialog: Invalid dialog data');
expect(fetch).toHaveBeenCalledTimes(1);
});
});

31
import_helper.py Normal file
View File

@ -0,0 +1,31 @@
"""
Import helper module for Chatterbox UI.
This module provides a function to add the project root to the Python path,
which helps resolve import issues when running scripts from different locations.
"""
import sys
import os
from pathlib import Path
def setup_python_path():
"""
Add the project root to the Python path.
This allows imports to work correctly regardless of where the script is run from.
"""
# Get the project root (parent of the directory containing this file)
project_root = Path(__file__).resolve().parent
# Add the project root to the Python path if it's not already there
if str(project_root) not in sys.path:
sys.path.insert(0, str(project_root))
print(f"Added {project_root} to Python path")
# Set environment variable for other modules to use
os.environ["PROJECT_ROOT"] = str(project_root)
return project_root
# Run setup when this module is imported
project_root = setup_python_path()

9
jest.config.cjs Normal file
View File

@ -0,0 +1,9 @@
// jest.config.cjs
module.exports = {
testEnvironment: 'node',
transform: {
'^.+\\.js$': 'babel-jest',
},
moduleFileExtensions: ['js', 'json'],
roots: ['<rootDir>/frontend/tests', '<rootDir>'],
};

5384
package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

25
package.json Normal file
View File

@ -0,0 +1,25 @@
{
"name": "chatterbox-test",
"version": "1.0.0",
"description": "This Gradio application provides a user interface for text-to-speech generation using the Chatterbox TTS model. It supports both single utterance generation and multi-speaker dialog generation with configurable silence gaps.",
"main": "index.js",
"type": "module",
"scripts": {
"test": "jest",
"test:frontend": "jest --config ./jest.config.cjs",
"frontend:dev": "python3 frontend/start_dev_server.py"
},
"repository": {
"type": "git",
"url": "https://gitea.r8z.us/stwhite/chatterbox-ui.git"
},
"keywords": [],
"author": "",
"license": "ISC",
"devDependencies": {
"@babel/core": "^7.27.4",
"@babel/preset-env": "^7.27.2",
"babel-jest": "^29.7.0",
"jest": "^29.7.0"
}
}

129
predator.txt Normal file
View File

@ -0,0 +1,129 @@
He watched from the pickup, his feet dangling off of the end of the tailgate as he sipped a beer and swung his boots back and forth. He adjusted the Royals baseball cap and leaned back on his left hand, languid in the warm summer evening, the last bit of sun having disappeared just ten minutes ago, bringing surcease from the ridiculous August heat in Missouri. The high, thin clouds were that beautiful shade of salmon that made this the end of the “Golden Hour”. His black tank top was damp from sweat but his jeans were clean and he still smelled good.
The girl got out of a blue Prius that looked black under the flickering yellow pall of a high-pressure sodium light in the next row of the parking lot. She had glossy, dark hair that fell in waves to her shoulders that were bared by a ribbed white tank top, its hems decorated with lace, baring her gold belly-button-ring. She wore cutoff jeans shorts - they just missed the “daisy” appellation, but were short enough, with a frill of loose cotton threads neatly trimmed into a small white fringe at the bottom of each cut-off leg - and brown sandals with wedge heels and leather laces up her ankles to her calves. A tiny brown leather clutch swung from a strap across her body, and she tucked her car keys in it, snapped it closed, and let it fall to her side. She was very lightly tanned, much paler than most hed seen around here. Her body was shapely; no bra, full breasts, narrow waist. Supple curves arced from her hips to her thighs and toned calves. Her fingers seemed both slender and strong, and she managed to look both muscular and soft. She started off towards the entrance to the fair at a brisk pace, her footfalls light, but she jiggled and bounced in interesting - and likely, intentional - ways.
He admired the way the muscles of her calves flexed to maintain her balance. She was breathtaking.
He hopped off of the tailgate, his boots crunching in the gravel, then closed it, silently, and began to follow her to the gate. He knew shed turn to the right after going through the gate; he wasnt worried he would lose her. He didnt know how he knew - he could just tell. Hell, even if she didnt, it was a big county fair, but not so big he couldnt find one pretty little brunette. As he walked, hips loose, boots crunching in the gravel, fifty feet back, he wondered if she was meeting anyone. He was far enough away that she didnt even glance back. He knew from experience if he broke into a run shed look back with the instinctive fear of a prey animal - a lovely, country-flavored gazelle or white tail on her cloven hooves of braided leather and cork. He didnt want her to know she was being pursued - not yet. That wasnt the game.
The fresh gray gravel was still hot from the sun, radiating heat into his feet and the air above it, along with the scent of dust and rock. It was mounded in the center, pressed down in the places where car tires rolled over it. Grasses pushed up to the edge of the gravel parking lot of the county fair. A slight breeze brought the distant smells of funnel cakes, hotdogs, and cotton candy. The lights of the fair were visible beyond the surrounding fence. He felt sweat in the small of his back and on his upper lip. Night was coming, though, and the temperature would continue to drop.
She reached the ticket booth - which was a folding card table flanked by bales of straw, manned by a fat, middle-aged, bleach-blonde woman and what he presumed was her fat, bored offspring. Mouth breathers, he observed. He could smell them from here, stale sweat, cigarette smoke, cheap cologne. He heard the girls voice, a rich, lilting contralto that made him feel like salivating. “Just one, please.”
“Adult?” asked the bored woman, not even looking up, just staring at the roll of tickets, the money box, the electronic payment device. The girl laughed and it rang through him like a bell, inflaming a hunger he knew well. “Yes, please.” she replied, waving her phone at the point-of-sale payment device. It chimed and the woman handed her a ticket and a bright orange wristband, then waved her on. “Have fun!” she called after the girl, in a voice so empty of enthusiasm it seemed to suck happiness from the very air around it. She had a KC Chiefs tee shirt and black jeans stretched to their tensile limit. He assumed she had some boots on as well, a rural affectation common in this county where there was more forest than cattle.
When he arrived at the table, she regarded him with dead, watery faded blue eyes. “Adult?”
“Yep.” He didnt even bother to laugh.
“Ten bucks.” She picked up the ticket and wrist band and waited expectantly. He pulled a ten from his pocket and laid it on the table in front of her. She took it and handed it to the kid, who was wearing a tee-shirt emblazoned with the words “Lets go Brandon”. “Have fun.” she repeated, just as enthusiastically as before, tucking a damp lock of chemically altered hair behind her left ear. He grunted noncommittally and strolled after his gazelle, whod gone out of sight - to his right, of course. Towards the paved area of the fair, where carnival rides and games blasted forth a cacophony of light and noise into the hot midwestern night, the smell of hot dogs, popcorn, and cotton candy vying for the attention of the press of fairgoers in their cowboy boots, jeans, and short skirts. Tee shirts here and there, and sometimes a pair of overalls, but it could have been a uniform. The people here were largely overweight, trending to dangerously obese, massive instances of humanity that lumbered, stomped, or waddled from game to game and food cart to food cart. He watched a dark-haired, overall-clad man who was at least six foot six and had to weigh 400 lbs on the hoof consume an enormous hot dog as though it were a light snack, in three quick bites, grease, mustard, and cheese running down his hand. He licked cheese from the back of his hand and wiped his hands on his capacious pants legs. He had a handgun in a high hip holster. Open carry was in evidence everywhere, peppered across demographics, from shapely young women with Glocks to octogenarians sporting well-worn 1911s and white flat-tops. It was Missouri, after all. He didnt need or want a gun, and it wouldnt do them any good if he turned his attention to them.
Cops were scattered through the fairground. Some were clearly private security, others might have been local police, sheriffs, or even highway patrol, for all he knew. There were at least four uniforms represented. Cops didnt concern him. He didnt look dangerous or threatening, and none looked at him directly, no scanning eyes paused on him or tracked his progress across the straw-strewn asphalt. It could get inconvenient if police became involved, of course, but he didnt worry much.
Hed gotten distracted, and was surprised as he nearly ran into his gazelle as she came around the end of a food cart, and he stopped suddenly to avoid bowling her over. She smiled and said “Excuse me!” and kept walking. “No problem!” he called after her, grinning at her back. It was good, he thought; interaction was key to breaking the ice later. Folks often walk the same direction through an exhibition like a fair, so being in the same general area over time wasnt unusual and shed never know he was stalking her. They never figured it out, not before he wanted them to figure it out. He was an attractive, friendly looking man with an open, disarming smile, medium brown hair, a strong, muscular body, capable, competent, without being threatening. He was tall, but not surprisingly so - six feet nothing, maybe a hair more in his boots. A hundred and eighty pounds on most days, no belly but not sporting a sharp six pack either. Women found him attractive but not threatening, which was his intention. His eyes were blue and he had a well-trimmed mustache and the slightest hint of stubble. He watched her without looking at her, noting that she was alone, but kept checking her phone, occasionally texting someone. If her friends didnt show up it would make it easier for him to get her attention, to draw her in.
He floated near her, just exploring the fair in the same sequence, seemingly by chance. He paid $3 to play a game of chess against a fellow who was playing 12 people simultaneously. Overhead light from an LED lamp on a pole lit a rectangle of narrow tables, four chess boards on each. The man playing chess was dressed like one might imagine Sherlock Holmes, with a pipe clamped in his teeth. Sherlock walked clockwise around the rectangle making a move on each board as he came to it. He crushed the “chess master” in eighteen moves and moved on before the man could comment. He threw darts at balloons while watching her from the corner of his eye as she tried to ring a bell by swinging a hammer. He saw her check her phone again and look exasperated, her full lips pursing in frustration at something she read on the screen. She shrugged and looked around, almost catching him staring. Her eyes roamed the area and paused for the tiniest second on his profile, then swept along to take in the rest of the area. He strolled slowly to the next attraction, which was a booth where one could pay $5 to throw three hatchets at targets for prizes. There was a roof held up by four-by-fours spaced every five or six feet; each pair of four-by-fours made a lane for throwing axes, and there was a big target at the end of each lane, maybe twenty feet away. Five lanes, the ubiquitous straw strewn over the asphalt - to give that barnyard feel, he thought. He stepped up and handed the barker a twenty. The man was cajoling onlookers, almost chanting, about trying your luck and winning prizes throwing the axes, and his voice never faltered. He had a belt pouch that contained change, and was wearing worn jeans, worn athletic shoes, and a worn tee-shirt from a rock concert of a band long forgotten in this day and age. Belt Pouch put three axes in the basket next to the lane opening, put three fives on the small change shelf and stepped aside, making the twenty vanish into the pouch.
He picked up the first ax and measured its weight in his hand. He judged the distance and tossed the ax overhand in a smooth gesture. It struck head-first with a loud thump and fell to the ground, the head clanging against the asphalt.. He picked up the next ax and tossed it without any theatrics and it stuck solid, outside the bullseye.. He flipped the third after it almost nonchalantly and it stuck next to its sibling, this time at the edge of the red circle. Belt Pouch paused for an instant, and retrieved the thrown axes, offering them to him, and he accepted with a nod. The carnies patter changed, saying something about watching an expert at work. He tossed all three, rapidly, one after the other, and they lined up on the bullseye, separated by a hairs breadth. The carnie laughed, and he heard a low whistle. A breeze swirled some loose bits of straw and cooled the light sweat on his back.
“Impressive.” she said, her voice rich and beautifully textured.
He shrugged. The carnie gathered the axes and offered them to him again. He nodded, not paying any attention to the man. “Wanna try it?” he asked the gazelle.
Her eyes were ice blue - he had expected them to be brown! - and long, dark lashes veiled them when she blinked. Her makeup was understated, but perfect - a dash of color and shadow. She cocked her head to one side, evaluating him, her lips curving slightly at the corners, the smile staying mostly in her eyes. She seemed to come to a decision and shrugged, then nodded. “Sure, why not? You make it look pretty easy!” She stepped up next to him and he yielded some space to allow her the center of the throwing lane. A couple of men in jeans and cowboy boots had stopped to watch, idly glancing from the target to him, then to her, their thumbs hooked in their belt loops. Their eyes lingered carefully on her, he could see, and they missed nothing. But she was his now. They would know better. The same way a jackal knew that the lions food was not for him.
She held out her hand and said, “Kim.” He took it, smooth and warm, and nodded. “Dave.” It wasnt his name. Hell, hers probably wasnt “Kim”. He knew how this sort of thing went. If hed been a normal man, at the end of the night shed have written a fake phone number on his palm and made him promise to call. “Nice to meet you, Dave.” She smiled a little and held out a hand. He passed her one of the hatchets and she bounced it in her hand, holding it like a hammer. “Heavier than it looks!” she observed.
“Have you done this before?”
She shook her head. “Is there a trick to it?”
“Isnt there always?”
She laughed and shrugged, then concentrated. She drew back, holding it more like he had, concentrating with a small frown, and smoothly flung it down-range. It struck handle-first and fell to the floor. Boom-clang. “Shit.”
“Its your first try! Dont be so hard on yourself.” he said, offering her the second worn ax, handle first. She took it and grinned. He glanced around, noting that at least four men were watching her carefully now, along with Belt Pouch, whod resumed his half-hearted patter about trying your luck and winning prizes, but was watching the couple with interest. Another breeze stirred some loose straw and made her hair flutter a bit as she turned and set her feet. She scuffed a foot on the asphalt. “Id probably do better with sneakers or boots. High heels arent really ideal for this sort of thing, I bet.” Concentrating, she drew back the ax, holding it almost exactly as he had, and smoothly tossing it downrange, where it stuck. Not in the bullseye, but on the target.
“Not too shabby!” He nodded approvingly, offering her the last ax. She flashed him a grin and took it, shifting her stance and her grip, then in one smooth motion, the ax sailed smoothly to the target and stuck, on the very edge of the red circle, just outside the bullseye. “Nice!” he said, grinning.
“I guess you made it look too easy.” She leaned against the 4x4, looking at him speculatively. “Win something for me.” She grinned, white teeth with the slightest hint of irregularity shining in the LED light.. “A teddy bear, or a beer hat, or, you know, something fair-appropriate. You can do it, right?”
He paused for a moment, regarding her. “Perhaps.” He glanced at the carnie and jerked his head, and the carnie correctly interpreted the motion and retrieved the axes and picked up the last five. “What can I win?”
“Youve already got some points racked up, so one bullseye will get you anything on this shelf.” He indicated a shelf littered with various sorts of toys, stuffed animals, lighters, and the like.
“What about three bullseyes?”
“Thats this shelf.” There was nothing obviously different about the two shelves except the “points” on the label, and the fact that it was the highest one, but he nodded. He turned back downrange and tossed all three in a smooth, mechanical sequence, and they once again lined up on the bullseye, thunk-thunk-thunk. The carnie looked at him, his gaze unreadable, and pointed at the highest shelf. “What can I get you?”
Dave glanced at the gazelle. “Kim? Choose your prize.” He grinned.
Her eyes flashed a grin in return and she stepped up to the rail, pointing. “That, right there.” It wasnt a teddy bear. It was a cheap ripoff of a Zippo lighter with a praying mantis enameled onto the front in green, yellow, and black. The carnie shrugged and plucked it from the shelf and deposited it in her hand. She weighed it in her palm and flicked it open and closed a few times.
“It wont work.” the carnie said. “No fluid in it. Youll have to load it up when you get home.”
She nodded and turned back to Dave. “Wanna get a beer?”
He nodded. “Sure. Just one, though. Im driving.” Together they threaded through the crowd to a place that had beer signs on posts. He noted the eyes of strangers on her as they made their way, and he grinned to himself. There was lust and jealousy and frustration in the eyes of the men. She really was quite attractive. A couple of women looked irritated, the way women sometimes do when a beautiful woman draws the attention of a man they feel belongs to them.
The “bar” was a roped off area set with high bar tables and stools, looking over a broad bit of straw-strewn ground where someone had erected a mechanical bull. It was surrounded with layers of foam pads a couple of inches thick, laid out so that the drunks tossed from the bulls back wouldnt end up traumatized in the emergency room, or worse. A couple of huge, slowly turning fans created a constant moderate breeze that felt good in the humid night air. Her hair fluttered as she hooked a foot into a stool and swung up onto the stool, to put her elbows down on the round tabletop, which was a mosaic of beer bottle caps entombed in some scuffed, clear plastic resin. Napkins, ketchup, mustard, and other condiments inhabited a little rack, along with salt and pepper packets. A waitress materialized at his elbow and mumbled something that ended in “... getcha?” He could smell a fryer and the aromas of bar food. Hot wings, french fries, hamburgers, nachos. He wasnt interested in that sort of thing, though.
Kim glanced at the beer menu clipped in a metal ring on the condiment carrier and tapped one - a mass market IPA. He held up two fingers, the waitress said, “Got it” and turned away. He hadnt wanted any food but was momentarily irritated that the mousy, pale woman hadnt asked him or his date. Kim grinned at him as though she could read his thoughts. One manicured finger tapped the table top and she cocked her head to one side again. “So, Dave, what do you do?
He crossed his arms and met her gaze. What was the right answer for this one? Hard working laborer, or executive out to play? Salesman, computer nerd, actor? “Guess” he finally said. “What do you think I do?”
“Go to county fairs to meet women.” Her reply was immediate, as though shed known what he was going to say. “Professional ax thrower. Maybe youre secretly a carnie on a night off?”
She wore a tiny cross on a chain and a pair of stud earrings that were just bright golden spheres against her earlobes. He decided she wasnt the sort to see herself as a gold digger and shrugged. “I work in a warehouse. Drive a forklift.”
“A workin man, eh? Union? I hear forklift driving is a decent gig if it's a union job.”
“Decent enough.” He shrugged. “Paid for my truck, keeps me in meals. It isnt for everyone, but I like it.” He shifted on the stool. “You?”
Just then the waitress returned with two bottles on a tray. “Ten dollars.” He took the bottles and dropped a ten and a five on the tray and she vanished without a word. Kim took a sip from the condensation-shrouded bottle and said “I do books. Taxes, accounting, that sort of thing. Got a four year degree in accounting and left for the big city - Lebanon, Missouri. It pays the bills.”
“Ever think about getting out?” he asked. “Heading for the big city? New York, Paris, you know. Bright lights and parties?” It was the question every rural and small town dweller asked themselves at some point. Cities were too dangerous for his kind, of course, but he knew how these people thought.
“Nah, not much. Nothing there for me. I have friends and family here.”
“That why youre here alone?”
“My sisters car broke down. Shit happens. And Im not alone, right?”
He shrugged, nodded, then took a sip of cold, bitter, hoppy beer. “Whyd you pick that thing?” he asked, suddenly, pointing at the cheap zippo ripoff.
She shrugged. “Ive just always loved praying mantises. They seem intelligent. They turn their head to watch you, and sometimes theyll dance with you.” She turned the lighter so the mantis was up, and opened the top. “Theyre related to walking sticks. Theres one in Indonesia that looks like an orchid; its evolved to pretend to be an orchid until the food gets close to what it sees as a flower. Its gorgeous, the same pastel colors as the orchids it sits on, all pinks and blues and purples.” She shut the lighter suddenly. “Then snap, the mantis moves like lightning and … dinner!”
“Sounds dangerous!” He grinned and tossed back most of the beer. He could feel her relaxing, the darkness that drove him a burning hunger in his chest. His skin felt like it was rippling with electricity and he could smell her, delicate, rich, delicious. For a moment he saw a glowing outline around her as his hunger grew. He set the bottle back down and waved away the waitress as she stepped forward to see if he wanted more. Kim took a long pull on hers and tossed the almost empty bottle into the trash bin a few feet away.
“Lets go wander around, see what there is to see.” She slid off the stool and stretched fetchingly, her tiny purse bouncing against her trim belly. They slipped out of the roped-off bar area into the crowd. They watched a few drunks get dumped off the mechanical bull and laughed. She refused his challenge to get on it, and he refused hers in turn. They wandered through the fair, watching people and talking. She put her hand on his arm and pointed. “Lets go get our fortunes read!” There was a squared-off trailer with a sign that said “Tarot, fortunes told, palms read, loved ones contacted” It was painted lots of colors and there was a small sign over the door that said “Entrance”. There were the ubiquitous straw bales delineating a small courtyard with eight chairs - all empty except one, inhabited by a tall, slender woman with frosted blond hair and dark eyes. She was cajoling passers by with promises of answers about love, life, and the future. When they turned into the tiny “courtyard” of hay bales, the woman stepped in front of them, holding up her hands. She shook her head. “This is not for you” she said, eyes hooded and giving up nothing. “We dont need your money.” He thought for a moment he saw a glow in her eyes, and there were definitely faint glowing outlines around the door.
“What was that all about?” Kim asked, looking back over her shoulder, her voice betraying some mild irritation. “This is not for you!” she mimicked the womans voice derisively. “What did I do? Do you know them? Ive never seen them before.”
He shook his head. “I dont know them.” But he did. He knew her kind. The darkness inside him gave her a name. Witch. But Kim wouldnt find that amusing if spoken aloud. He glanced back and saw the woman making a hand gesture at their backs, her thumb clamped between her index and middle finger. It couldnt hurt him, but witches had been known to have … helpers; helpers who could hurt him.
“Eh, its just as well.” she said. “Its getting late, Im tired, and I should probably head home.”
“So soon?” he let disappointment creep into his voice. “Can I see you again?” It was the game, and he played it well.
She smiled and shook her head. “This wasnt that kind of date, Dave. You know it, I know it. I wont even write a fake number on your hand and implore you to call me sometime.”
He looked at her, a bit surprised. “You too good for a forklift driver?”
Her eyebrow raised and her blue eyes sparkled. “No, of course not. I just have a policy about men I meet alone at the fair. I know why you came here alone.”
He shrugged. It wouldnt matter anyway. Hed catch her in the parking lot and it wouldnt make any difference at all. She couldnt get away; hed already chosen her. He would just miss that delicious moment where the prey, having surrendered her trust, would suddenly recognize the error she had made and comfort would turn to terror, her heart leaping in fear and hammering against her ribs, her eyes going wide and her breasts heaving, nipples erect with fear and adrenaline as he forced her down with hands too strong for his size and build. This one would already be frightened when he got his hands on her. It would still be delicious, though. She might survive. Some did, empty husks, devoid of everything that makes life rich and beautiful, empty of life, of love, soulless in a sense. Most did not survive, giving up the spark of life along with the flame that he took.
She must have seen something in his eyes, then, because she looked a little uncomfortable. She waved her hand at him and started towards the gate, walking quickly without looking back. He could feel the tension in her body, the fear. She was already telling herself she was being ridiculous though, telling herself that he wasnt a danger to her. She turned the corner around a food cart onto one of the fairs rows of games and shops and walked out of sight, carefully not looking back. He knew that shed glance over her shoulder as soon as she thought she was out of sight. He moved, quickly, but not running, not drawing undue attention. He slipped between a couple of trailers and stepped over the mobile rail that marked off the fair from the fields around it and moved out of the light. Then he ran, his feet light, his heart beating, the thrill of the hunt coursing through his veins and the darkness within him crying out a wordless “YES”.
He rounded a large red shipping container that marked the edge of the parking lot and slipped in between the rows of trucks and SUVs. There werent many people there, but there was Kim, walking from the gate and almost trotting towards her Prius, glancing back over her shoulder furtively. He took a deep breath and could smell her rich scent, now tinged with fear and exertion, making his skin tingle and buzz with energy. He ducked low and paced along silently, just behind the row of cars where her Prius sat waiting. She gained the car and he was mere feet from her when he stood and said “Hi.”
She yelped, a sharp, bright sound, and bolted, sprinting between the cars and out towards the open field and the woods a hundred feet beyond. He laughed and didnt even care that two cops had heard her and were running after him. He trotted lightly after her, wanting her to make it to the trees before he caught her, but the cops were faster than her and were gaining on him. “On the ground!” one shouted, and drew a taser. Dave juked to one side, turned suddenly, faster than humanly possible, and drove a rigid hand into the neck of the pursuing cop, crushing his trachea and driving a shockwave into his spine. The cop was unconscious before he hit the ground, and Dave was ducking and rolling toward the other cop who hadnt quite realized what had happened. The second cop got his gun out but Dave had his hand on it before it cleared the holster, and he stripped it away, taking some of the cops hand with it and silencing the mans sudden shout of pain with another vicious blow to the throat, the butt of the pistol crushing through cartilage and driving a vertebra so far out of alignment with the rest of his spine it severed the cord, the magic string, and he fell to the ground like roast and potatoes spilled from a platter. The world fell silent again except for the sound of her running feet, getting close to the trees.
His blood was singing and the darkness in him filled him to bursting, rendering the night in sharp relief, enabling him to see in this blackness as well as he could during the day. He could see her in the trees, a glowing body of beauty and heat and life, scrambling between the trees and trying to put distance between them. He moved silently, but fast, too fast for a human, for he was not only human, not at all. He was a predator, a hunter, and she was his meat. He was not a vampire, nor an incubus, but those legends might have originated with tales of creatures like him, creatures of darkness and stealth that lived on the delicious life of the prey they had hunted through the ages.
He drew even with her, silent in the darkness, and he could see her as though it were noon. Her eyes were wide and staring, rolling back and forth - he knew she couldnt see him at all. He stepped down hard to break a twig and she froze at the sudden snap, staring around, trying to keep from breathing too loudly. She crept forward, trying to be quiet, trying to escape, without knowing it was already far too late. He stepped close to her and touched her neck with a gentle finger. Her entire body spasmed and she made a quiet, breathless whimpering sound, lunging away from his touch. He could see her trying to produce the scream trapped in her mind, but terror stole her breath and all that escaped was a croaking sound. He stepped close and ripped her tank top from her in a single move, exposing her body to his vision and his alone. She covered her breasts and whimpered, backing away from where she thought he was. He took two steps and ran his hand down her torso, gently, caressing, and she thrashed again and let out a little shout. He grabbed her by the throat, lifted her, and slammed her to the ground, driving the air from her lungs, and lay on her, his face close to her ear. “No screaming!” he said, quietly, and she turned her head away and tried to push him away, ineffectual and weak. He held her down by the slender throat and clawed her shorts off with the other hand and she sobbed, trying to cover herself. He grabbed one of her wrists in each hand and spread them as far apart as he could, forcing his knees between her legs and pressing her body down with his torso. She tried to buck but it didnt matter, it didnt move him. Not him. She was his prey, and he was here to consume her, not to be pushed away.
He looked into her wide, staring eyes, and thrust himself inside her. Or tried. Something was wrong - hed missed some bit of clothing… Hed encountered something hard, like shed been wearing some kind of chastity belt or … what the fuck? He transferred both of her wrists to his right hand and held them above her head and started to reach down to investigate and at that moment, her legs lifted and snapped around him, strong, hard, crushing him to her, his hips locked into place by legs he should have been able to push away easily but instead held him like iron bands, urging him closer. And the thing hed mistaken for a chastity belt opened - he felt it, oh shit oh shit oh shit - and took in what hed tried to thrust into her and bit into it with sharp teeth like hypodermic needles - he felt the loss and the rush of blood and release of pressure - and an immense, empty cold began to flood into him at that junction between them, a vacuum that sucked out of him everything he was or had ever been and the darkness in him gibbered and capered in terror it had never before known. It was his turn to wrestle weakly and ineffectually to try and break the deadly embrace. Her arms, suddenly as strong as hydraulic presses, pulled easily from his grasp and embraced him, pulling him close to her, pressing him to her body, once soft and supple, now hard and glossy. The coldness and emptiness grew in him, emptying him, and in the eldritch vision the darkness granted him he saw it, in the darkness, the dark, chitinous, triumphant, enormous body just on the other side of the veil, disguised in this world as a pale, soft, attractive girl… exactly the kind of girl he sought out, he hunted, he consumed. His thoughts spun, fear gripping him, his arms flailing uselessly as the emptiness consumed everything that was him. Then, at last, there was final darkness, and felt himself evaporating into it, and was no more.
There was silence for a moment in the trees, and everything was still and quiet. Something stirred, something pale and slender. His body was tossed aside, empty now of everything important, and the girl stood, naked but for lace-up wedge-heeled sandals, her body soft and supple again. Her clothing re-appeared over her flesh as though it were extruded from another place, and her makeup restored itself, the smears and streaks fading back into perfect order. She smoothed the ribbed tank top, now clean again and free of leaves or litter,, ran a slender hand through her hair, and started back towards her car.

6
requirements.txt Normal file
View File

@ -0,0 +1,6 @@
gradio>=3.50.0
PyYAML>=6.0
torch>=2.0.0
torchaudio>=2.0.0
numpy>=1.21.0
chatterbox-tts

21
sample-audiobook.txt Normal file
View File

@ -0,0 +1,21 @@
# The Importance of Text-to-Speech Technology
Text-to-speech (TTS) technology has become increasingly important in our digital world. It enables computers and other devices to convert written text into spoken words, making content more accessible to a wider audience.
## Applications of TTS
TTS has numerous applications across various fields. In education, it helps students with reading difficulties by allowing them to listen to text. For people with visual impairments, TTS serves as a crucial tool for accessing digital content.
Mobile devices use TTS for navigation instructions, allowing drivers to keep their eyes on the road. Voice assistants like Siri and Alexa rely on TTS to communicate with users, answering questions and providing information.
## Recent Advancements
Recent advancements in neural network-based TTS systems have dramatically improved the quality of synthesized speech. Modern TTS voices sound more natural and expressive than ever before, with proper intonation, rhythm, and emphasis.
Chatterbox TTS represents the cutting edge of this technology, offering highly realistic voice synthesis that can be customized for different speakers and styles. This makes it ideal for creating audiobooks, podcasts, and other spoken content with a personal touch.
## Future Directions
The future of TTS technology looks promising, with ongoing research focused on making synthesized voices even more natural and emotionally expressive. We can expect to see TTS systems that can adapt to different contexts, conveying appropriate emotions and speaking styles based on the content.
As TTS technology continues to evolve, it will play an increasingly important role in human-computer interaction, accessibility, and content consumption.

123
setup-windows.ps1 Normal file
View File

@ -0,0 +1,123 @@
#Requires -Version 5.1
<#!
Chatterbox TTS - Windows setup script
What it does:
- Creates a Python virtual environment in .venv (if missing)
- Upgrades pip
- Installs dependencies from backend/requirements.txt and requirements.txt
- Creates a default .env with sensible ports if not present
- Launches start_servers.py using the venv's Python
Usage:
- Right-click this file and "Run with PowerShell" OR from PowerShell:
./setup-windows.ps1
- Optional flags:
-NoInstall -> Skip installing dependencies (just start servers)
-NoStart -> Prepare env but do not start servers
Notes:
- You may need to allow script execution once:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
- Press Ctrl+C in the console to stop both servers.
!#>
param(
[switch]$NoInstall,
[switch]$NoStart
)
$ErrorActionPreference = 'Stop'
function Write-Info($msg) { Write-Host "[INFO] $msg" -ForegroundColor Cyan }
function Write-Ok($msg) { Write-Host "[ OK ] $msg" -ForegroundColor Green }
function Write-Warn($msg) { Write-Host "[WARN] $msg" -ForegroundColor Yellow }
function Write-Err($msg) { Write-Host "[FAIL] $msg" -ForegroundColor Red }
$root = Split-Path -Parent $MyInvocation.MyCommand.Path
Set-Location $root
$venvDir = Join-Path $root ".venv"
$venvPython = Join-Path $venvDir "Scripts/python.exe"
# 1) Ensure Python available
function Get-BasePython {
try {
$pyExe = (Get-Command py -ErrorAction SilentlyContinue)
if ($pyExe) { return 'py -3' }
} catch { }
try {
$pyExe = (Get-Command python -ErrorAction SilentlyContinue)
if ($pyExe) { return 'python' }
} catch { }
throw "Python not found. Please install Python 3.x and add it to PATH."
}
# 2) Create venv if missing
if (-not (Test-Path $venvPython)) {
Write-Info "Creating virtual environment in .venv"
$basePy = Get-BasePython
if ($basePy -eq 'py -3') {
& py -3 -m venv .venv
} else {
& python -m venv .venv
}
Write-Ok "Virtual environment created"
} else {
Write-Info "Using existing virtual environment: $venvDir"
}
if (-not (Test-Path $venvPython)) {
throw ".venv python not found at $venvPython"
}
# 3) Install dependencies
if (-not $NoInstall) {
Write-Info "Upgrading pip"
& $venvPython -m pip install --upgrade pip
# Backend requirements
$backendReq = Join-Path $root 'backend/requirements.txt'
if (Test-Path $backendReq) {
Write-Info "Installing backend requirements"
& $venvPython -m pip install -r $backendReq
} else {
Write-Warn "backend/requirements.txt not found"
}
# Root requirements (optional frontend / project libs)
$rootReq = Join-Path $root 'requirements.txt'
if (Test-Path $rootReq) {
Write-Info "Installing root requirements"
& $venvPython -m pip install -r $rootReq
} else {
Write-Warn "requirements.txt not found at repo root"
}
Write-Ok "Dependency installation complete"
}
# 4) Ensure .env exists with sensible defaults
$envPath = Join-Path $root '.env'
if (-not (Test-Path $envPath)) {
Write-Info "Creating default .env"
@(
'BACKEND_PORT=8000',
'BACKEND_HOST=127.0.0.1',
'FRONTEND_PORT=8001',
'FRONTEND_HOST=127.0.0.1'
) -join "`n" | Out-File -FilePath $envPath -Encoding utf8 -Force
Write-Ok ".env created"
} else {
Write-Info ".env already exists; leaving as-is"
}
# 5) Start servers
if ($NoStart) {
Write-Info "-NoStart specified; setup complete. You can start later with:"
Write-Host " `"$venvPython`" `"$root\start_servers.py`"" -ForegroundColor Gray
exit 0
}
Write-Info "Starting servers via start_servers.py"
& $venvPython "$root/start_servers.py"

86
setup.py Normal file
View File

@ -0,0 +1,86 @@
#!/usr/bin/env python3
"""
Setup script for Chatterbox TTS Application
This script helps configure the application for different environments.
"""
import os
import shutil
from pathlib import Path
def setup_environment():
"""Setup environment configuration files"""
project_root = Path(__file__).parent
print("🔧 Setting up Chatterbox TTS Application...")
# Create .env file if it doesn't exist
env_file = project_root / ".env"
env_example = project_root / ".env.example"
if not env_file.exists() and env_example.exists():
print("📝 Creating .env file from .env.example...")
shutil.copy(env_example, env_file)
# Update PROJECT_ROOT in .env to current directory
with open(env_file, 'r') as f:
content = f.read()
content = content.replace('/path/to/your/chatterbox-ui', str(project_root))
with open(env_file, 'w') as f:
f.write(content)
print(f"✅ Created .env file with PROJECT_ROOT set to: {project_root}")
else:
print(" .env file already exists")
# Setup backend .env
backend_env = project_root / "backend" / ".env"
backend_env_example = project_root / "backend" / ".env.example"
if not backend_env.exists() and backend_env_example.exists():
print("📝 Creating backend/.env file...")
shutil.copy(backend_env_example, backend_env)
# Update PROJECT_ROOT in backend .env
with open(backend_env, 'r') as f:
content = f.read()
content = content.replace('/Users/stwhite/CODE/chatterbox-ui', str(project_root))
with open(backend_env, 'w') as f:
f.write(content)
print(f"✅ Created backend/.env file")
# Setup frontend .env
frontend_env = project_root / "frontend" / ".env"
frontend_env_example = project_root / "frontend" / ".env.example"
if not frontend_env.exists() and frontend_env_example.exists():
print("📝 Creating frontend/.env file...")
shutil.copy(frontend_env_example, frontend_env)
print(f"✅ Created frontend/.env file")
# Create necessary directories
directories = [
project_root / "speaker_data" / "speaker_samples",
project_root / "tts_temp_outputs",
project_root / "backend" / "tts_generated_dialogs"
]
for directory in directories:
directory.mkdir(parents=True, exist_ok=True)
print(f"📁 Created directory: {directory}")
print("\n🎉 Setup complete!")
print("\n📋 Next steps:")
print("1. Review and adjust the .env files as needed")
print("2. Install backend dependencies: cd backend && pip install -r requirements.txt")
print("3. Start backend server: cd backend && python start_server.py")
print("4. Start frontend server: cd frontend && python start_dev_server.py")
print("5. Open http://127.0.0.1:8001 in your browser")
if __name__ == "__main__":
setup_environment()

View File

@ -0,0 +1,21 @@
831c1dbe-c379-4d9f-868b-9798adc3c05d:
name: Adam
sample_path: speaker_samples/831c1dbe-c379-4d9f-868b-9798adc3c05d.wav
608903c4-b157-46c5-a0ea-4b25eb4b83b6:
name: Denise
sample_path: speaker_samples/608903c4-b157-46c5-a0ea-4b25eb4b83b6.wav
3c93c9df-86dc-4d67-ab55-8104b9301190:
name: Maria
sample_path: speaker_samples/3c93c9df-86dc-4d67-ab55-8104b9301190.wav
fb84ce1c-f32d-4df9-9673-2c64e9603133:
name: Debbie
sample_path: speaker_samples/fb84ce1c-f32d-4df9-9673-2c64e9603133.wav
90fcd672-ba84-441a-ac6c-0449a59653bd:
name: dummy_speaker
sample_path: speaker_samples/90fcd672-ba84-441a-ac6c-0449a59653bd.wav
a6387c23-4ca4-42b5-8aaf-5699dbabbdf0:
name: Mike
sample_path: speaker_samples/a6387c23-4ca4-42b5-8aaf-5699dbabbdf0.wav
6cf4d171-667d-4bc8-adbb-6d9b7c620cb8:
name: Minnie
sample_path: speaker_samples/6cf4d171-667d-4bc8-adbb-6d9b7c620cb8.wav

View File

@ -0,0 +1,36 @@
831c1dbe-c379-4d9f-868b-9798adc3c05d:
name: Adam
sample_path: speaker_samples/831c1dbe-c379-4d9f-868b-9798adc3c05d.wav
608903c4-b157-46c5-a0ea-4b25eb4b83b6:
name: Denise
sample_path: speaker_samples/608903c4-b157-46c5-a0ea-4b25eb4b83b6.wav
3c93c9df-86dc-4d67-ab55-8104b9301190:
name: Maria
sample_path: speaker_samples/3c93c9df-86dc-4d67-ab55-8104b9301190.wav
fb84ce1c-f32d-4df9-9673-2c64e9603133:
name: Debbie
sample_path: speaker_samples/fb84ce1c-f32d-4df9-9673-2c64e9603133.wav
90fcd672-ba84-441a-ac6c-0449a59653bd:
name: dummy_speaker
sample_path: speaker_samples/90fcd672-ba84-441a-ac6c-0449a59653bd.wav
a6387c23-4ca4-42b5-8aaf-5699dbabbdf0:
name: Mike
sample_path: speaker_samples/a6387c23-4ca4-42b5-8aaf-5699dbabbdf0.wav
6cf4d171-667d-4bc8-adbb-6d9b7c620cb8:
name: Minnie
sample_path: speaker_samples/6cf4d171-667d-4bc8-adbb-6d9b7c620cb8.wav
f1377dc6-aec5-42fc-bea7-98c0be49c48e:
name: Glinda
sample_path: speaker_samples/f1377dc6-aec5-42fc-bea7-98c0be49c48e.wav
dd3552d9-f4e8-49ed-9892-f9e67afcf23c:
name: emily
sample_path: speaker_samples/dd3552d9-f4e8-49ed-9892-f9e67afcf23c.wav
2cdd6d3d-c533-44bf-a5f6-cc83bd089d32:
name: Grace
sample_path: speaker_samples/2cdd6d3d-c533-44bf-a5f6-cc83bd089d32.wav
3d3e85db-3d67-4488-94b2-ffc189fbb287:
name: RCB
sample_path: speaker_samples/3d3e85db-3d67-4488-94b2-ffc189fbb287.wav
f754cf35-892c-49b6-822a-f2e37246623b:
name: Jim
sample_path: speaker_samples/f754cf35-892c-49b6-822a-f2e37246623b.wav

147
start_servers.py Executable file
View File

@ -0,0 +1,147 @@
#!/Users/stwhite/CODE/chatterbox-ui/.venv/bin/python
"""
Startup script that launches both the backend and frontend servers concurrently.
"""
import os
import sys
import time
import signal
import subprocess
import threading
from pathlib import Path
# Try to load environment variables, but don't fail if dotenv is not available
try:
from dotenv import load_dotenv
load_dotenv()
except ImportError:
print("python-dotenv not installed, using system environment variables only")
# Configuration
BACKEND_PORT = int(os.getenv("BACKEND_PORT", "8000"))
BACKEND_HOST = os.getenv("BACKEND_HOST", "0.0.0.0")
# Frontend host/port (for dev server binding)
FRONTEND_PORT = int(os.getenv("FRONTEND_PORT", "8001"))
FRONTEND_HOST = os.getenv("FRONTEND_HOST", "0.0.0.0")
# Export frontend host/port so backend CORS config can pick them up automatically
os.environ["FRONTEND_HOST"] = FRONTEND_HOST
os.environ["FRONTEND_PORT"] = str(FRONTEND_PORT)
# Get project root directory
PROJECT_ROOT = Path(__file__).parent.absolute()
def run_backend():
"""Run the backend FastAPI server"""
os.chdir(PROJECT_ROOT / "backend")
cmd = [
sys.executable,
"-m",
"uvicorn",
"app.main:app",
"--reload",
f"--host={BACKEND_HOST}",
f"--port={BACKEND_PORT}",
]
print(f"\n{'='*50}")
print(f"Starting Backend Server at http://{BACKEND_HOST}:{BACKEND_PORT}")
print(f"API docs available at http://{BACKEND_HOST}:{BACKEND_PORT}/docs")
print(f"{'='*50}\n")
return subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True,
bufsize=1,
)
def run_frontend():
"""Run the frontend development server"""
frontend_dir = PROJECT_ROOT / "frontend"
os.chdir(frontend_dir)
cmd = [sys.executable, "start_dev_server.py"]
env = os.environ.copy()
env["VITE_DEV_SERVER_HOST"] = FRONTEND_HOST
env["VITE_DEV_SERVER_PORT"] = str(FRONTEND_PORT)
print(f"\n{'='*50}")
print(f"Starting Frontend Server at http://{FRONTEND_HOST}:{FRONTEND_PORT}")
print(f"{'='*50}\n")
return subprocess.Popen(
cmd,
env=env,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True,
bufsize=1,
)
def print_process_output(process, prefix):
"""Print process output with a prefix"""
for line in iter(process.stdout.readline, ""):
if not line:
break
print(f"{prefix} | {line}", end="")
def main():
"""Main function to start both servers"""
print("\n🚀 Starting Chatterbox UI Development Environment")
# Start the backend server
backend_process = run_backend()
# Give the backend a moment to start
time.sleep(2)
# Start the frontend server
frontend_process = run_frontend()
# Create threads to monitor and print output
backend_monitor = threading.Thread(
target=print_process_output, args=(backend_process, "BACKEND"), daemon=True
)
frontend_monitor = threading.Thread(
target=print_process_output, args=(frontend_process, "FRONTEND"), daemon=True
)
backend_monitor.start()
frontend_monitor.start()
# Setup signal handling for graceful shutdown
def signal_handler(sig, frame):
print("\n\n🛑 Shutting down servers...")
backend_process.terminate()
frontend_process.terminate()
# Threads are daemon, so they'll exit when the main thread exits
print("✅ Servers stopped successfully")
sys.exit(0)
signal.signal(signal.SIGINT, signal_handler)
# Print access information
print("\n📋 Access Information:")
print(f" • Frontend: http://{FRONTEND_HOST}:{FRONTEND_PORT}")
print(f" • Backend API: http://{BACKEND_HOST}:{BACKEND_PORT}/api")
print(f" • API Documentation: http://{BACKEND_HOST}:{BACKEND_PORT}/docs")
print("\n⚠️ Press Ctrl+C to stop both servers\n")
# Keep the main process running
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
signal_handler(None, None)
if __name__ == "__main__":
main()

110
storage_service.py Normal file
View File

@ -0,0 +1,110 @@
"""
Project storage service for saving and loading Chatterbox TTS projects.
"""
import json
import os
import asyncio
from pathlib import Path
from typing import List, Optional
from datetime import datetime
from models import DialogProject, DialogLine
class ProjectStorage:
"""Handles saving and loading projects to/from JSON files."""
def __init__(self, storage_dir: str = "projects"):
self.storage_dir = Path(storage_dir)
self.storage_dir.mkdir(exist_ok=True)
async def save_project(self, project: DialogProject) -> bool:
"""Save a project to a JSON file."""
try:
project_file = self.storage_dir / f"{project.id}.json"
# Convert to dict and ensure timestamps are strings
project_data = project.dict()
project_data["last_modified"] = datetime.now().isoformat()
# Ensure created_at is set if not already
if not project_data.get("created_at"):
project_data["created_at"] = datetime.now().isoformat()
with open(project_file, 'w', encoding='utf-8') as f:
json.dump(project_data, f, indent=2, ensure_ascii=False)
return True
except Exception as e:
print(f"Error saving project {project.id}: {e}")
return False
async def load_project(self, project_id: str) -> Optional[DialogProject]:
"""Load a project from a JSON file."""
try:
project_file = self.storage_dir / f"{project_id}.json"
if not project_file.exists():
return None
with open(project_file, 'r', encoding='utf-8') as f:
project_data = json.load(f)
# Validate that audio files still exist
for line in project_data.get("lines", []):
if line.get("audio_url"):
audio_path = Path("dialog_output") / line["audio_url"].split("/")[-1]
if not audio_path.exists():
line["audio_url"] = None
line["status"] = "pending"
return DialogProject(**project_data)
except Exception as e:
print(f"Error loading project {project_id}: {e}")
return None
async def list_projects(self) -> List[dict]:
"""List all saved projects with metadata."""
projects = []
for project_file in self.storage_dir.glob("*.json"):
try:
with open(project_file, 'r', encoding='utf-8') as f:
project_data = json.load(f)
projects.append({
"id": project_data["id"],
"name": project_data["name"],
"created_at": project_data.get("created_at"),
"last_modified": project_data.get("last_modified"),
"line_count": len(project_data.get("lines", [])),
"has_audio": any(line.get("audio_url") for line in project_data.get("lines", []))
})
except Exception as e:
print(f"Error reading project file {project_file}: {e}")
continue
# Sort by last modified (most recent first)
projects.sort(key=lambda x: x.get("last_modified", ""), reverse=True)
return projects
async def delete_project(self, project_id: str) -> bool:
"""Delete a saved project."""
try:
project_file = self.storage_dir / f"{project_id}.json"
if project_file.exists():
project_file.unlink()
return True
return False
except Exception as e:
print(f"Error deleting project {project_id}: {e}")
return False
async def project_exists(self, project_id: str) -> bool:
"""Check if a project exists in storage."""
project_file = self.storage_dir / f"{project_id}.json"
return project_file.exists()
# Global storage instance
project_storage = ProjectStorage()

51
test.py Normal file
View File

@ -0,0 +1,51 @@
import torch
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
# Detect device (Mac with M1/M2/M3/M4)
device = "mps" if torch.backends.mps.is_available() else "cpu"
def safe_load_chatterbox_tts(device="mps"):
"""
Safely load ChatterboxTTS model with proper device mapping.
Handles cases where model was saved on CUDA but needs to be loaded on MPS/CPU.
"""
# Store original torch.load function
original_torch_load = torch.load
def patched_torch_load(f, map_location=None, **kwargs):
# If no map_location is specified and we're loading on non-CUDA device,
# map CUDA tensors to the target device
if map_location is None:
if device == "mps" and torch.backends.mps.is_available():
map_location = torch.device("mps")
elif device == "cpu" or not torch.cuda.is_available():
map_location = torch.device("cpu")
else:
map_location = torch.device(device)
return original_torch_load(f, map_location=map_location, **kwargs)
# Temporarily patch torch.load
torch.load = patched_torch_load
try:
# Load the model with the patched torch.load
model = ChatterboxTTS.from_pretrained(device=device)
return model
finally:
# Restore original torch.load
torch.load = original_torch_load
model = safe_load_chatterbox_tts(device=device)
text = "Today is the day. I want to move like a titan at dawn, sweat like a god forging lightning. No more excuses. From now on, my mornings will be temples of discipline. I am going to work out like the gods… every damn day."
# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(
text,
audio_prompt_path=AUDIO_PROMPT_PATH,
exaggeration=2.0,
cfg_weight=0.5
)
ta.save("test-2.wav", wav, model.sr)