Skip to content

fix: deflake //rs/tests/networking:canister_http_flexible_test#9857

Open
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-canister_http_flexible_test-2026-04-14
Open

fix: deflake //rs/tests/networking:canister_http_flexible_test#9857
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-canister_http_flexible_test-2026-04-14

Conversation

@basvandijk
Copy link
Copy Markdown
Collaborator

@basvandijk basvandijk commented Apr 14, 2026

Root Cause

During GuestOS boot, the orchestrator writes the nftables configuration file to /run/ic-node/nftables-ruleset/nftables.conf. The systemd path unit reload_nftables.path detects the file change and triggers reload_nftables.service, which flushes the ruleset and reloads nftables.

The orchestrator-generated nftables config contains the hostname hostos in rules like:

ip6 saddr { hostos } ct state { new } tcp dport { 42372 } accept

This hostname is resolved at nft load time via the nss_icos NSS plugin. However, early at boot, systemd-resolved may not yet be fully operational when reload_nftables.service runs. This causes nft to fail with:

/etc/nftables.conf:95:17-22: Error: Could not resolve hostname: Temporary failure in name resolution

Since nft is transactional, the entire ruleset is discarded. The orchestrator only re-writes the config file when its content changes (not on every 10s check cycle), so the reload is never retried. This leaves the ip6 filter OUTPUT chain missing entirely, causing wait_for_orchestrator_fw_rule in the test driver to time out after 60 seconds and the test setup to fail.

This was observed consistently in all 3 flaky runs from the past week — all nodes had the same DNS resolution failure at boot.

Fix

Added After=nss-lookup.target systemd-resolved.service and Wants=nss-lookup.target to reload_nftables.service. This ensures that when the path unit triggers a reload, systemd delays starting the service until name resolution is available, so the hostos hostname can be resolved by the nss_icos plugin.

Verification

bazel test --test_output=errors //rs/tests/networking:canister_http_flexible_test passes. All affected unit tests also pass.


This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces flakiness in //rs/tests/networking:canister_http_flexible_test by making the test driver proactively recover when the expected ip6 filter OUTPUT nftables chain is missing due to a transient boot-time reload failure.

Changes:

  • Extend wait_for_orchestrator_fw_rule_once to attempt systemctl reload nftables when nft list chain ip6 filter OUTPUT indicates the chain is missing.
  • Document the boot-time DNS race/root-cause directly next to the check to clarify why the reload is needed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add systemd ordering dependencies to reload_nftables.service so that
nftables config reloads wait for name resolution (nss-lookup.target and
systemd-resolved.service) to be available.

The orchestrator-generated nftables config contains the hostname
"hostos" (resolved by the nss_icos NSS plugin). When the orchestrator
writes the config early at boot, reload_nftables.path triggers a reload
before NSS is ready, causing nft to fail with "Could not resolve
hostname: Temporary failure in name resolution". Since nft is
transactional, the entire ruleset is discarded. The orchestrator only
re-writes the file when the config content changes, so the reload is
never retried, leaving the firewall without orchestrator-managed rules.

This also reverts the test driver workaround from the initial commit.
@basvandijk basvandijk force-pushed the ai/deflake-canister_http_flexible_test-2026-04-14 branch from b55e9a0 to 6802b37 Compare April 14, 2026 13:32
@basvandijk basvandijk requested a review from Copilot April 14, 2026 13:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@basvandijk basvandijk marked this pull request as ready for review April 14, 2026 14:14
@basvandijk basvandijk requested a review from a team as a code owner April 14, 2026 14:14
@github-actions github-actions bot added the @node label Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants