Skip to content

fix(api): retry agent lookup on WebSocket connect to prevent 404 race#910

Closed
nicseltzer wants to merge 18 commits intoRightNow-AI:mainfrom
nicseltzer:fix/804-ws-agent-race
Closed

fix(api): retry agent lookup on WebSocket connect to prevent 404 race#910
nicseltzer wants to merge 18 commits intoRightNow-AI:mainfrom
nicseltzer:fix/804-ws-agent-race

Conversation

@nicseltzer
Copy link
Copy Markdown

Summary

  • When a client opens a WebSocket connection to an agent immediately after spawn, the WS upgrade request can arrive before the kernel finishes inserting the agent into the registry, causing a spurious 404.
  • Replaced the single-shot registry.get() check with a retry loop that polls every 100ms for up to 3 seconds before returning 404.
  • No new abstractions; change is a targeted 14-line diff in crates/openfang-api/src/ws.rs.

Test plan

  • cargo build --workspace --lib — passes
  • cargo clippy --workspace --all-targets -- -D warnings — zero warnings
  • Manual: start daemon, open WS immediately after agent create, verify connection succeeds instead of 404

Fixes #804

🤖 Generated with Claude Code

tytsxai and others added 18 commits March 29, 2026 08:58
Parse the mentions JSON array from Mattermost event data and set
was_mentioned metadata when the bot user ID is present, enabling
agents to distinguish direct mentions from background group traffic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The agent config reload logic was missing skills and mcp_servers from
the change detection, so edits to these fields in agent.toml weren't
being picked up when loading agents from SQLite.

Added both fields to the comparison to ensure proper hot-reload.
The skills and mcp_servers fields must be at the top level of the
agent.toml, not after [capabilities], due to TOML implicit table
ordering rules.
`MessageContent::text_length()` returned 0 for `ToolUse` blocks,
ignoring the tool name and JSON input arguments. This caused the
compactor's `estimate_token_count()` (which uses `text_length()`)
to massively undercount tokens when conversations contained tool
calls with large arguments (e.g. web_search results, page content).

The result: compaction never triggered despite the session exceeding
the context window, leading to "Token limit exceeded" errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lientTransportConfig

The struct is marked #[non_exhaustive] in rmcp, so struct expression
syntax is rejected by the compiler. Switch to Default + field assignment.

Confidence: high
Scope-risk: narrow
Extract is_silent_token() helper for case-insensitive [SILENT] detection.
Revert unrelated Cargo.lock and formatting changes. Add unit tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a Hand agent is respawned (on daemon restart or reactivation),
activate_hand() rebuilds the manifest entirely from HAND.toml, silently
discarding any tool_allowlist/tool_blocklist changes made via the API.
The API returned {"status":"ok"} for these updates, creating a broken
contract: changes appeared to succeed but were lost on the next restart.

Fix: capture the existing agent's tool filters before killing it at
respawn time, and reapply them to the freshly-built manifest. Empty
filters are treated as "no override" (not "block all") so hands with no
API-set filters continue to get the full tool list from HAND.toml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the API-patch preservation introduced in 81eb7cb to cover model
config fields. Previously, provider, model, and temperature changes made
via the agents API were lost every time a Hand was respawned (daemon
restart, crash recovery, or manual reactivation) because activate_hand()
rebuilt the AgentManifest entirely from the compile-time-embedded HAND.toml.

Changes:
- kernel.rs: single registry scan now captures both tool filters and model
  config before rebuild; existing_model_override carries provider/model/
  temperature and is reapplied after the manifest is built, only when the
  live values differ from what HAND.toml would produce.  system_prompt is
  intentionally excluded — it is assembled dynamically from HAND.toml plus
  settings context and must stay live.
- registry.rs: add update_temperature() for hot-patching sampling temp.
- routes.rs: expose temperature in list/get responses; add temperature field
  to PatchAgentConfigRequest and implement it in patch_agent_config with
  0.0–2.0 validation.
- index_body.html + agents.js: temperature input in the agent config tab.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dashboard passwords were hashed with plain SHA256 (no salt), vulnerable
to rainbow tables and GPU brute force. Switch to Argon2id with random
per-hash salts. Breaking change: existing SHA256 hashes in config.toml
must be regenerated with `openfang auth hash-password`.
Addresses review feedback:
- Add `openfang auth hash-password` subcommand so users can generate
  Argon2id hashes after upgrading (the command referenced in docs).
- Emit a tracing::warn at daemon startup when auth is enabled but the
  password_hash is not in Argon2id format, so users know why login fails.
Regenerated lockfile after cherry-picking 9 upstream PRs.
Fixed needless borrow in line.rs introduced by PR #877 formatting.

Confidence: high
Scope-risk: narrow
43 AGENTS.md files providing AI-readable documentation for every
significant directory. Each includes purpose, key files, subdirectory
links, agent working instructions, testing requirements, and dependency
listings. Linked via <!-- Parent: --> hierarchy.

Confidence: high
Scope-risk: narrow
When an agent is being spawned, the WS upgrade can arrive before
registry insertion completes. Retry the lookup for up to 3 seconds
before returning 404.

Fixes: #804
Confidence: high
Scope-risk: narrow
@jaberjaber23
Copy link
Copy Markdown
Member

Reviewed and approved. The real code change is solid. However this PR has merge conflicts and ~3,000 lines of auto-generated AGENTS.md files inflating the diff. Please: (1) rebase onto current main, (2) remove the AGENTS.md files from this PR (submit those separately if desired). We will merge immediately after.

@nicseltzer
Copy link
Copy Markdown
Author

Ope -- I didn't even mean for these to push upstream. My bad!

1 similar comment
@nicseltzer
Copy link
Copy Markdown
Author

Ope -- I didn't even mean for these to push upstream. My bad!

@nicseltzer nicseltzer closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent chat produces no response with Ollama provider — WebSocket returns 404

7 participants