chatterbox-ui/.note/unload_model_plan.md

# Unload Model on Idle: Implementation Plan

## Goals
- Automatically unload large TTS model(s) when idle to reduce RAM/VRAM usage.
- Lazy-load on demand without breaking API semantics.
- Configurable timeout and safety controls.

## Requirements
- Config-driven idle timeout and poll interval.
- Thread-/async-safe across concurrent requests.
- No unload while an inference is in progress.
- Clear logs and metrics for load/unload events.

## Configuration
File: `backend/app/config.py`
- Add:
  - `MODEL_IDLE_TIMEOUT_SECONDS: int = 900` (0 disables eviction)
  - `MODEL_IDLE_CHECK_INTERVAL_SECONDS: int = 60`
  - `MODEL_EVICTION_ENABLED: bool = True`
- Bind to env: `MODEL_IDLE_TIMEOUT_SECONDS`, `MODEL_IDLE_CHECK_INTERVAL_SECONDS`, `MODEL_EVICTION_ENABLED`.

## Design
### ModelManager (Singleton)
File: `backend/app/services/model_manager.py` (new)
- Responsibilities:
  - Manage lifecycle (load/unload) of the TTS model/pipeline.
  - Provide `get()` that returns a ready model (lazy-load if needed) and updates `last_used`.
  - Track active request count to block eviction while > 0.
- Internals:
  - `self._model` (or components), `self._last_used: float`, `self._active: int`.
  - Locks: `asyncio.Lock` for load/unload; `asyncio.Lock` or `asyncio.Semaphore` for counters.
  - Optional CUDA cleanup: `torch.cuda.empty_cache()` after unload.
- API:
  - `async def get(self) -> Model`: ensures loaded; bumps `last_used`.
  - `async def load(self)`: idempotent; guarded by lock.
  - `async def unload(self)`: only when `self._active == 0`; clears refs and caches.
  - `def touch(self)`: update `last_used`.
  - Context helper: `async def using(self)`: async context manager incrementing/decrementing `active` safely.

### Idle Reaper Task
Registration: FastAPI startup (e.g., in `backend/app/main.py`)
- Background task loop every `MODEL_IDLE_CHECK_INTERVAL_SECONDS`:
  - If eviction enabled and timeout > 0 and model is loaded and `active == 0` and `now - last_used >= timeout`, call `unload()`.
- Handle cancellation on shutdown.

### API Integration
- Replace direct model access in endpoints with:
  ```python
  manager = ModelManager.instance()
  async with manager.using():
      model = await manager.get()
      # perform inference
  ```
- Optionally call `manager.touch()` at request start for non-inference paths that still need the model resident.

## Pseudocode
```python
# services/model_manager.py
import time, asyncio
from typing import Optional
from .config import settings

class ModelManager:
    _instance: Optional["ModelManager"] = None

    def __init__(self):
        self._model = None
        self._last_used = time.time()
        self._active = 0
        self._lock = asyncio.Lock()
        self._counter_lock = asyncio.Lock()

    @classmethod
    def instance(cls):
        if not cls._instance:
            cls._instance = cls()
        return cls._instance

    async def load(self):
        async with self._lock:
            if self._model is not None:
                return
            # ... load model/pipeline here ...
            self._model = await load_pipeline()
            self._last_used = time.time()

    async def unload(self):
        async with self._lock:
            if self._model is None:
                return
            if self._active > 0:
                return  # safety: do not unload while in use
            # ... free resources ...
            self._model = None
            try:
                import torch
                torch.cuda.empty_cache()
            except Exception:
                pass

    async def get(self):
        if self._model is None:
            await self.load()
        self._last_used = time.time()
        return self._model

    async def _inc(self):
        async with self._counter_lock:
            self._active += 1

    async def _dec(self):
        async with self._counter_lock:
            self._active = max(0, self._active - 1)
            self._last_used = time.time()

    def last_used(self):
        return self._last_used

    def is_loaded(self):
        return self._model is not None

    def active(self):
        return self._active

    def using(self):
        manager = self
        class _Ctx:
            async def __aenter__(self):
                await manager._inc()
                return manager
            async def __aexit__(self, exc_type, exc, tb):
                await manager._dec()
        return _Ctx()

# main.py (startup)
@app.on_event("startup")
async def start_reaper():
    async def reaper():
        while True:
            try:
                await asyncio.sleep(settings.MODEL_IDLE_CHECK_INTERVAL_SECONDS)
                if not settings.MODEL_EVICTION_ENABLED:
                    continue
                timeout = settings.MODEL_IDLE_TIMEOUT_SECONDS
                if timeout <= 0:
                    continue
                m = ModelManager.instance()
                if m.is_loaded() and m.active() == 0 and (time.time() - m.last_used()) >= timeout:
                    await m.unload()
            except asyncio.CancelledError:
                break
            except Exception as e:
                logger.exception("Idle reaper error: %s", e)
    app.state._model_reaper_task = asyncio.create_task(reaper())

@app.on_event("shutdown")
async def stop_reaper():
    task = getattr(app.state, "_model_reaper_task", None)
    if task:
        task.cancel()
        with contextlib.suppress(Exception):
            await task
```
```

## Observability
- Logs: model load/unload, reaper decisions, active count.
- Metrics (optional): counters and gauges (load events, active, residency time).

## Safety & Edge Cases
- Avoid unload when `active > 0`.
- Guard multiple loads/unloads with lock.
- Multi-worker servers: each worker manages its own model.
- Cold-start latency: document expected additional latency for first request after idle unload.

## Testing
- Unit tests for `ModelManager`: load/unload idempotency, counter behavior.
- Simulated reaper triggering with short timeouts.
- Endpoint tests: concurrency (N simultaneous inferences), ensure no unload mid-flight.

## Rollout Plan
1. Introduce config + Manager (no reaper), switch endpoints to `using()`.
2. Enable reaper with long timeout in staging; observe logs/metrics.
3. Tune timeout; enable in production.

## Tasks Checklist
- [ ] Add config flags and defaults in `backend/app/config.py`.
- [ ] Create `backend/app/services/model_manager.py`.
- [ ] Register startup/shutdown reaper in app init (`backend/app/main.py`).
- [ ] Refactor endpoints to use `ModelManager.instance().using()` and `get()`.
- [ ] Add logs and optional metrics.
- [ ] Add unit/integration tests.
- [ ] Update README/ops docs.

## Alternatives Considered
- Gunicorn/uvicorn worker preloading with external idle supervisor: more complexity, less portability.
- OS-level cgroup memory pressure eviction: opaque and risky for correctness.

## Configuration Examples
```
MODEL_EVICTION_ENABLED=true
MODEL_IDLE_TIMEOUT_SECONDS=900
MODEL_IDLE_CHECK_INTERVAL_SECONDS=60
```