Skip to content

Harden CI test network cleanup; sweep orphaned iptables rules#285

Merged
rgarcia merged 1 commit into
mainfrom
fix/ci-test-network-cleanup
Jun 10, 2026
Merged

Harden CI test network cleanup; sweep orphaned iptables rules#285
rgarcia merged 1 commit into
mainfrom
fix/ci-test-network-cleanup

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

Random flaky failures in the Linux Test job on the self-hosted runner (dev-deft) — instance did not reach Running within 45s, image should be ready after 60 seconds — across multiple unrelated branches simultaneously, i.e. environmental, not code.

Root cause: the integration tests create per-bridge iptables FORWARD/NAT rules. The test-harness cleanup called iptables without -w and ignored all errors, so under xtables-lock contention from concurrent CI jobs (11 runners on one host) cleanup failed silently. Rules for deleted bridges then leaked permanently (cleanupStaleLinkDownRoutes only handles still-present linkdown routes) — ~880 dead-bridge rules had accumulated over 7 days of host uptime, inflating every iptables op ~10x and worsening the -w 5 lock-wait timeouts that trigger the boot failures.

Note: the iptables leak is a slow-burn aggravator; the primary acute driver is burst concurrency on the shared host. Capping Test-job concurrency is a separate follow-up.

Changes

  • Add -w 5 to all harness iptables calls (matches lib/network/bridge_linux.go).
  • Surface cleanup errors to stderr instead of swallowing; retry deletes on transient lock contention; treat already-gone routes/links as benign.
  • In-process orphan sweep (no systemd timer): once per test binary, under the existing subnet lock, delete test iptables rules whose bridge interface is gone — scoped via regex to the hm test prefix only, so a separate non-test hypeman process's ha rules are never touched.
  • Remove dead HYPEMAN_TEST_NETWORK_TMPDIR plumbing from test.yml (its Go reader was reverted in Remove mailbox for now #268; re-wiring would reintroduce per-run lock/lease isolation and break cross-run subnet coordination).

Testing

gofmt/vet/build pass. The integration tests require root+linux+network, so validating via CI on this branch (with a couple of re-runs to confirm the flakes subside). Host's accumulated orphans were already cleaned out-of-band.

🤖 Generated with Claude Code


Note

Medium Risk
Changes run privileged iptables/ip cleanup on shared CI hosts; scope is limited to hm test rules and error handling, but mistakes could affect host networking during tests.

Overview
Hardens Linux integration test network teardown so leaked iptables rules stop piling up on the shared self-hosted runner and slowing every job under xtables lock contention.

The test harness now runs iptables with -w 5 (same as production bridge_linux.go), logs real cleanup failures to stderr instead of ignoring them, retries deletes on lock contention, and treats already-removed routes/links as benign. On first subnet allocation per test binary, it sweeps orphaned hm* test-harness rules in FORWARD and nat/POSTROUTING when the referenced bridge no longer exists—without touching ha* production rules.

CI drops the unused HYPEMAN_TEST_NETWORK_TMPDIR setup, env var, and /tmp/hm-net-* cleanup from .github/workflows/test.yml.

Reviewed by Cursor Bugbot for commit 8989bc9. Bugbot is set up for automated code reviews on this repo. Configure here.

The Linux integration tests create per-bridge iptables FORWARD/NAT rules.
On the shared self-hosted runner these leak: the test-harness cleanup
called iptables without the -w lock-wait flag and ignored all errors, so
under xtables-lock contention from concurrent CI jobs cleanup failed
silently. Rules whose bridge interface no longer exists then accumulated
indefinitely (cleanupStaleLinkDownRoutes only handles still-present
linkdown routes), inflating every iptables operation over time and
contributing to flaky "instance did not reach Running within 45s"
timeouts.

- Add -w 5 to all harness iptables calls (matches lib/network convention)
- Surface cleanup errors to stderr instead of swallowing them; retry
  deletes on transient lock contention; treat already-gone routes/links
  as benign
- Sweep orphaned test iptables rules (interface gone) once per test
  binary under the existing subnet lock, scoped to the "hm" test prefix
  so a non-test hypeman process's "ha" rules are never touched
- Remove dead HYPEMAN_TEST_NETWORK_TMPDIR plumbing from test.yml (its Go
  reader was reverted in #268; re-wiring it would reintroduce per-run
  lock/lease isolation and break cross-run subnet coordination)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgarcia rgarcia requested a review from hiroTamada June 10, 2026 01:23
@rgarcia rgarcia merged commit 46331ed into main Jun 10, 2026
17 checks passed
@rgarcia rgarcia deleted the fix/ci-test-network-cleanup branch June 10, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants