Skip to content

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95

Draft
SakshiKekre wants to merge 3 commits into
feat/model-backend-selectorfrom
feat/topic-gate
Draft

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95
SakshiKekre wants to merge 3 commits into
feat/model-backend-selectorfrom
feat/topic-gate

Conversation

@SakshiKekre
Copy link
Copy Markdown
Collaborator

@SakshiKekre SakshiKekre commented Jun 1, 2026

Two small, additive changes to the chat backend. Both are perf/UX fixes on the request critical path; both visible together in the same preview deploy.

1. Opt-in Haiku topic gate to short-circuit off-topic messages

Today the chat answers anything. "What's the capital of France?" today returns Paris-then-pivots-to-UK-policy — burning the full system prompt + reference doc on input tokens and output tokens for the apology. Every off-topic message does this.

Rate limiting (PR #48) caps request volume; iteration capping (PR #87) bounds runaway loops. Neither prevents off-topic acceptance.

This PR adds a pre-step that classifies the last user message with a single Haiku call (~$0.001) and short-circuits with a canned SSE refusal if it's clearly off-topic.

Off by default. Opt-in via:
```
POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional
```

Calibration (boundary cases the classifier prompt is tuned for):

Prompt Classifier Why
"Capital of France?" reject unambiguously off-topic
"What did the chancellor say yesterday?" reject news, not policy
"How will the PA reform affect inflation?" let through eval A4 — main loop should explain microsim vs macro
"What's the EITC?" let through factual policy lookup
ambiguous / malformed reply let through fail-open by design

False negatives (rejecting on-topic) are worse than false positives (accepting off-topic). The latter wastes a few cents; the former breaks the product.

2. Speed up /chat/backends from ~30-45s to <1s

Cold-container symptom: the backend-selector dropdown in the frontend was taking 30-45s to render after page load, while the chat input rendered fast. Root cause was `/chat/backends` paying for first-time imports of `policyengine_uk_compiled` + `policyengine_uk` (and `policyengine_us` on PR #54) inside `available_backends() → package_version()`.

Two small fixes:

  • `modal_app.py`: extend `_preload_engine` to also pre-import the Python backends (best-effort; failures non-fatal). Shifts the heavy OpenFisca import from request time to image build time.
  • `backend/model_backends.py`: memoise `available_backends()` output. The values don't change within a deploy, so `importlib.metadata.version()` only runs once per container.

Combined: `/chat/backends` returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.

Why combined

Both small (~30 lines each), both touch the chat-message critical path, both visible together in the same preview deploy when testing. Topic gate is the bigger feature; the warmup is the perf fix you'd want for any demo where someone watches the page load.

Files

  • `backend/routes/chatbot.py` (+93): topic gate helpers + early-return wire-up
  • `backend/tests/test_topic_gate.py` (+86): classifier parser tests, fail-open, end-to-end gate stub
  • `backend/model_backends.py` (+26/-8): memoised `available_backends()`
  • `modal_app.py` (+18/-2): Python-backend pre-import

Stacked on PR #51

Base = `feat/model-backend-selector`. Once #51 merges, this auto-rebases to main.

Test plan

  • Confirm `/chat/backends` returns quickly on warm preview container
  • Open the preview, watch for the backend dropdown to render fast on first load (cold container)
  • Flip `POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true` on the preview's Modal secret; ask the four boundary cases and confirm behaviour matches the calibration table
  • Flip the env var back off; confirm baseline behaviour returns

Not in scope

  • Tuning the classifier prompt against a held-out eval set — manual examples for now
  • Telemetry for how often the gate fires (PR Add eval harness scaffold: spec, scenarios, fixtures dir #52's runner could surface `refused_by_topic_gate` from the `done` event — separate work)
  • Modal `min_containers=1` keep-warm — overkill for previews

Today every message — including "what's the capital of France?" — hits
the full chat loop: system prompt, reference doc, tools, often several
iterations before Claude decides the question isn't on-topic. Each one
burns input + output tokens.

This adds a pre-step that runs the last user message through a single
small classification call (Haiku by default) and short-circuits with a
canned SSE refusal if it's clearly off-topic. Wired in /chat/message
after the billing check, before backend resolution.

Calibration choices (in the classifier's system prompt):
- Reject only when unambiguously not policy (capitals, sports, news,
  general advice).
- Let everything ambiguous through. Eval A4 ("how does this reform
  affect inflation?") is a deliberate let-through — the main loop's
  scope refusal is the right place to handle that, not a pre-filter.
- Any classification error fails open. Wasting a few cents is worse
  than wrongly rejecting an on-topic question.

Gate is off by default. Opt-in via:
  POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
  POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5  # optional override

Tests cover parser behaviour, empty-input shortcut, error-path fail-open,
and a TestClient-level check that the gate produces the expected SSE
shape when on.
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment Jun 2, 2026 2:20pm

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Beta preview is ready.

Two small fixes that together remove a 30-45s cold-start wait on the
backend-selector dropdown in the frontend.

1. modal_app.py: extend _preload_engine to also import policyengine_uk
   (and policyengine_us if installed). Best-effort — failures are
   non-fatal. Shifts the heavy OpenFisca import from request time to
   image build time.

2. model_backends.py: cache available_backends() output. The values
   don't change within a deploy, so importlib.metadata.version() —
   which can trigger the package import we're trying to avoid — only
   runs once per container.

Combined effect: /chat/backends returns in <100ms on a warm container
and ~1s on cold, vs 30-45s today.
@SakshiKekre SakshiKekre changed the title Add opt-in Haiku topic gate to /chat/message Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup) Jun 2, 2026
The backend warmup landed the cold-start /chat/backends time from
30-45s down to ~12-15s (Modal container cold-start itself). The dropdown
just rendered nothing during that window, which reads as "broken."

Now it shows a small spinner + "Loading engines…" until the fetch
resolves. Doesn't gate sending a message — UK compiled is the default
anyway, so a user who sends before the dropdown settles still gets the
right backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant