chatterbox-ui/.note/concurrency_plan.md

# Chatterbox TTS Backend: Bounded Concurrency + File I/O Offload Plan

Date: 2025-08-14
Owner: Backend
Status: Proposed (ready to implement)

## Goals

- Increase GPU utilization and reduce wall-clock time for dialog generation.
- Keep model lifecycle stable (leveraging current `ModelManager`).
- Minimal-risk changes: no API shape changes to clients.

## Scope

- Implement bounded concurrency for per-line speech chunk generation within a single dialog request.
- Offload audio file writes to threads to overlap GPU compute and disk I/O.
- Add configuration knobs to tune concurrency.

## Current State (References)

- `backend/app/services/dialog_processor_service.py`
  - `DialogProcessorService.process_dialog()` iterates items and awaits `tts_service.generate_speech(...)` sequentially (lines ~171–201).
- `backend/app/services/tts_service.py`
  - `TTSService.generate_speech()` runs the TTS forward and calls `torchaudio.save(...)` on the event loop thread (blocking).
- `backend/app/services/model_manager.py`
  - `ModelManager.using()` tracks active work; prevents idle eviction during requests.
- `backend/app/routers/dialog.py`
  - `process_dialog_flow()` expects ordered `segment_files` and then concatenates; good to keep order stable.

## Design Overview

1) Bounded concurrency at dialog level

- Plan all output segments with a stable `segment_idx` (including speech chunks, silence, and reused audio).
- For speech chunks, schedule concurrent async tasks with a global semaphore set by config `TTS_MAX_CONCURRENCY` (start at 3–4).
- Await all tasks and collate results by `segment_idx` to preserve order.

2) File I/O offload

- Replace direct `torchaudio.save(...)` with `await asyncio.to_thread(torchaudio.save, ...)` in `TTSService.generate_speech()`.
- This lets the next GPU forward start while previous file writes happen on worker threads.

## Configuration

Add to `backend/app/config.py`:

- `TTS_MAX_CONCURRENCY: int` (default: `int(os.getenv("TTS_MAX_CONCURRENCY", "3"))`).
- Optional (future): `TTS_ENABLE_AMP_ON_CUDA: bool = True` to allow mixed precision on CUDA only.

## Implementation Steps

### A. Dialog-level concurrency

- File: `backend/app/services/dialog_processor_service.py`
- Function: `DialogProcessorService.process_dialog()`

1. Planning pass to assign indices

- Iterate `dialog_items` and build a list `planned_segments` entries:
  - For silence or reuse: immediately append a final result with assigned `segment_idx` and continue.
  - For speech: split into `text_chunks`; for each chunk create a planned entry: `{ segment_idx, type: 'speech', speaker_id, text_chunk, abs_speaker_sample_path, tts_params }`.
  - Increment `segment_idx` for every planned segment (speech chunk or silence/reuse) to preserve final order.

2. Concurrency setup

- Create `sem = asyncio.Semaphore(config.TTS_MAX_CONCURRENCY)`.
- For each planned speech segment, create a task with an inner wrapper:

  ```python
  async def run_one(planned):
      async with sem:
          try:
              out_path = await self.tts_service.generate_speech(
                  text=planned.text_chunk,
                  speaker_sample_path=planned.abs_speaker_sample_path,
                  output_filename_base=planned.filename_base,
                  output_dir=dialog_temp_dir,
                  exaggeration=planned.exaggeration,
                  cfg_weight=planned.cfg_weight,
                  temperature=planned.temperature,
              )
              return planned.segment_idx, {"type": "speech", "path": str(out_path), "speaker_id": planned.speaker_id, "text_chunk": planned.text_chunk}
          except Exception as e:
              return planned.segment_idx, {"type": "error", "message": f"Error generating speech: {e}", "text_chunk": planned.text_chunk}
  ```

- Schedule with `asyncio.create_task(run_one(p))` and collect tasks.

3. Await and collate

- `results_map = {}`; for each completed task, set `results_map[idx] = payload`.
- Merge: start with all previously final (silence/reuse/error) entries placed by `segment_idx`, then fill speech results by `segment_idx` into a single `segment_results` list sorted ascending by index.
- Keep `processing_log` entries for each planned segment (queued, started, finished, errors).

4. Return value unchanged

- Return `{"log": ..., "segment_files": segment_results, "temp_dir": str(dialog_temp_dir)}`. This maintains router and concatenator behavior.

### B. Offload audio writes

- File: `backend/app/services/tts_service.py`
- Function: `TTSService.generate_speech()`

1. After obtaining `wav` tensor, replace:

```python
# torchaudio.save(str(output_file_path), wav, self.model.sr)
```

with:

```python
await asyncio.to_thread(torchaudio.save, str(output_file_path), wav, self.model.sr)
```

- Keep the rest of cleanup logic (delete `wav`, `gc.collect()`, cache emptying) unchanged.

2. Optional (CUDA-only AMP)

- If CUDA is used and `config.TTS_ENABLE_AMP_ON_CUDA` is True, wrap forward with AMP:

```python
with torch.cuda.amp.autocast(dtype=torch.float16):
    wav = self.model.generate(...)
```

- Leave MPS/CPU code path as-is.

## Error Handling & Ordering

- Every planned segment owns a unique `segment_idx`.
- On failure, insert an error record at that index; downstream concatenation will skip missing/nonexistent paths already.
- Preserve exact output order expected by `routers/dialog.py::process_dialog_flow()`.

## Performance Expectations

- GPU util should increase from ~50% to 75–90% depending on dialog size and line lengths.
- Wall-clock reduction is workload-dependent; target 1.5–2.5x on multi-line dialogs.

## Metrics & Instrumentation

- Add timestamped log entries per segment: planned→queued→started→saved.
- Log effective concurrency (max in-flight), and cumulative GPU time if available.
- Optionally add a simple timing summary at end of `process_dialog()`.

## Testing Plan

1. Unit-ish

- Small dialog (3 speech lines, 1 silence). Ensure ordering is stable and files exist.
- Introduce an invalid speaker to verify error propagation doesn’t break the rest.

2. Integration

- POST `/api/dialog/generate` with 20–50 mixed-length lines and a couple silences.
- Validate: response OK, concatenated file exists, zip contains all generated speech segments, order preserved.
- Compare runtime vs. sequential baseline (before/after).

3. Stress/limits

- Long lines split into many chunks; verify no OOM with `TTS_MAX_CONCURRENCY`=3.
- Try `TTS_MAX_CONCURRENCY`=1 to simulate sequential; compare metrics.

## Rollout & Config Defaults

- Default `TTS_MAX_CONCURRENCY=3`.
- Expose via environment variable; no client changes needed.
- If instability observed, set `TTS_MAX_CONCURRENCY=1` to revert to sequential behavior quickly.

## Risks & Mitigations

- OOM under high concurrency → Mitigate with low default, easy rollback, and chunking already in place.
- Disk I/O saturation → Offload to threads; if disk is a bottleneck, decrease concurrency.
- Model thread safety → We call `model.generate` concurrently only up to semaphore cap; if underlying library is not thread-safe for forward passes, consider serializing forwards but still overlapping with file I/O; early logs will reveal.

## Follow-up (Out of Scope for this change)

- Dynamic batching queue inside `TTSService` for further GPU efficiency.
- CUDA AMP enablement and profiling.
- Per-speaker sub-queues if batching requires same-speaker inputs.

## Acceptance Criteria

- `TTS_MAX_CONCURRENCY` is configurable; default=3.
- File writes occur via `asyncio.to_thread`.
- Order of `segment_files` unchanged relative to sequential output.
- End-to-end works for both small and large dialogs; error cases logged.
- Observed GPU utilization and runtime improve on representative dialog.