fix: deflake //rs/tests/networking:canister_http_flexible_test#9857
Open
basvandijk wants to merge 1 commit intomasterfrom
Open
fix: deflake //rs/tests/networking:canister_http_flexible_test#9857basvandijk wants to merge 1 commit intomasterfrom
basvandijk wants to merge 1 commit intomasterfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces flakiness in //rs/tests/networking:canister_http_flexible_test by making the test driver proactively recover when the expected ip6 filter OUTPUT nftables chain is missing due to a transient boot-time reload failure.
Changes:
- Extend
wait_for_orchestrator_fw_rule_onceto attemptsystemctl reload nftableswhennft list chain ip6 filter OUTPUTindicates the chain is missing. - Document the boot-time DNS race/root-cause directly next to the check to clarify why the reload is needed.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add systemd ordering dependencies to reload_nftables.service so that nftables config reloads wait for name resolution (nss-lookup.target and systemd-resolved.service) to be available. The orchestrator-generated nftables config contains the hostname "hostos" (resolved by the nss_icos NSS plugin). When the orchestrator writes the config early at boot, reload_nftables.path triggers a reload before NSS is ready, causing nft to fail with "Could not resolve hostname: Temporary failure in name resolution". Since nft is transactional, the entire ruleset is discarded. The orchestrator only re-writes the file when the config content changes, so the reload is never retried, leaving the firewall without orchestrator-managed rules. This also reverts the test driver workaround from the initial commit.
b55e9a0 to
6802b37
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root Cause
During GuestOS boot, the orchestrator writes the nftables configuration file to
/run/ic-node/nftables-ruleset/nftables.conf. The systemd path unitreload_nftables.pathdetects the file change and triggersreload_nftables.service, which flushes the ruleset and reloads nftables.The orchestrator-generated nftables config contains the hostname
hostosin rules like:This hostname is resolved at
nftload time via thenss_icosNSS plugin. However, early at boot,systemd-resolvedmay not yet be fully operational whenreload_nftables.serviceruns. This causesnftto fail with:Since
nftis transactional, the entire ruleset is discarded. The orchestrator only re-writes the config file when its content changes (not on every 10s check cycle), so the reload is never retried. This leaves theip6 filter OUTPUTchain missing entirely, causingwait_for_orchestrator_fw_rulein the test driver to time out after 60 seconds and the test setup to fail.This was observed consistently in all 3 flaky runs from the past week — all nodes had the same DNS resolution failure at boot.
Fix
Added
After=nss-lookup.target systemd-resolved.serviceandWants=nss-lookup.targettoreload_nftables.service. This ensures that when the path unit triggers a reload, systemd delays starting the service until name resolution is available, so thehostoshostname can be resolved by thenss_icosplugin.Verification
bazel test --test_output=errors //rs/tests/networking:canister_http_flexible_testpasses. All affected unit tests also pass.This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.