Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95
Draft
SakshiKekre wants to merge 3 commits into
Draft
Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95SakshiKekre wants to merge 3 commits into
SakshiKekre wants to merge 3 commits into
Conversation
Today every message — including "what's the capital of France?" — hits
the full chat loop: system prompt, reference doc, tools, often several
iterations before Claude decides the question isn't on-topic. Each one
burns input + output tokens.
This adds a pre-step that runs the last user message through a single
small classification call (Haiku by default) and short-circuits with a
canned SSE refusal if it's clearly off-topic. Wired in /chat/message
after the billing check, before backend resolution.
Calibration choices (in the classifier's system prompt):
- Reject only when unambiguously not policy (capitals, sports, news,
general advice).
- Let everything ambiguous through. Eval A4 ("how does this reform
affect inflation?") is a deliberate let-through — the main loop's
scope refusal is the right place to handle that, not a pre-filter.
- Any classification error fails open. Wasting a few cents is worse
than wrongly rejecting an on-topic question.
Gate is off by default. Opt-in via:
POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional override
Tests cover parser behaviour, empty-input shortcut, error-path fail-open,
and a TestClient-level check that the gate produces the expected SSE
shape when on.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview is ready.
|
Two small fixes that together remove a 30-45s cold-start wait on the backend-selector dropdown in the frontend. 1. modal_app.py: extend _preload_engine to also import policyengine_uk (and policyengine_us if installed). Best-effort — failures are non-fatal. Shifts the heavy OpenFisca import from request time to image build time. 2. model_backends.py: cache available_backends() output. The values don't change within a deploy, so importlib.metadata.version() — which can trigger the package import we're trying to avoid — only runs once per container. Combined effect: /chat/backends returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.
The backend warmup landed the cold-start /chat/backends time from 30-45s down to ~12-15s (Modal container cold-start itself). The dropdown just rendered nothing during that window, which reads as "broken." Now it shows a small spinner + "Loading engines…" until the fetch resolves. Doesn't gate sending a message — UK compiled is the default anyway, so a user who sends before the dropdown settles still gets the right backend.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two small, additive changes to the chat backend. Both are perf/UX fixes on the request critical path; both visible together in the same preview deploy.
1. Opt-in Haiku topic gate to short-circuit off-topic messages
Today the chat answers anything. "What's the capital of France?" today returns Paris-then-pivots-to-UK-policy — burning the full system prompt + reference doc on input tokens and output tokens for the apology. Every off-topic message does this.
Rate limiting (PR #48) caps request volume; iteration capping (PR #87) bounds runaway loops. Neither prevents off-topic acceptance.
This PR adds a pre-step that classifies the last user message with a single Haiku call (~$0.001) and short-circuits with a canned SSE refusal if it's clearly off-topic.
Off by default. Opt-in via:
```
POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional
```
Calibration (boundary cases the classifier prompt is tuned for):
False negatives (rejecting on-topic) are worse than false positives (accepting off-topic). The latter wastes a few cents; the former breaks the product.
2. Speed up /chat/backends from ~30-45s to <1s
Cold-container symptom: the backend-selector dropdown in the frontend was taking 30-45s to render after page load, while the chat input rendered fast. Root cause was `/chat/backends` paying for first-time imports of `policyengine_uk_compiled` + `policyengine_uk` (and `policyengine_us` on PR #54) inside `available_backends() → package_version()`.
Two small fixes:
Combined: `/chat/backends` returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.
Why combined
Both small (~30 lines each), both touch the chat-message critical path, both visible together in the same preview deploy when testing. Topic gate is the bigger feature; the warmup is the perf fix you'd want for any demo where someone watches the page load.
Files
Stacked on PR #51
Base = `feat/model-backend-selector`. Once #51 merges, this auto-rebases to main.
Test plan
Not in scope