Skip to content

fix(build): host musl cross-compile fails on macOS with ProcessFdQuotaExceeded (missing ulimit guard) #2112

Description

@purp

Problem Statement

On macOS, mise run gateway fails while cross-compiling the openshell-sandbox supervisor binary, with error: unable to search for static library …/libwant-*.rlib: ProcessFdQuotaExceeded. The static -unknown-linux-musl link opens ~333 .rlib files at once, which exceeds macOS's default per-process open-file limit of 256 (ulimit -n). This blocks the most common local dev path (docker driver, and the podman driver too) for any macOS contributor whose shell uses the OS default limit. The VM driver already guards against this; the docker/podman paths do not.

Technical Context

The supervisor image (deploy/docker/Dockerfile.supervisor) is FROM scratch and only COPYs a prebuilt openshell-sandbox binary — it never compiles Rust. So the binary must be cross-compiled on the host and staged into the build context first. On macOS that host build uses cargo zigbuild targeting *-unknown-linux-musl. That link step opens every dependency rlib simultaneously; with 333 rlibs and a soft limit of 256, the linker gets EMFILE, surfaced by zig as ProcessFdQuotaExceeded.

The project already knows about this failure mode: tasks/scripts/vm/build-supervisor-bundle.sh defines ensure_build_nofile_limit() which raises ulimit -n (to OPENSHELL_VM_BUILD_NOFILE_LIMIT, default 8192, min 1024) before its cargo zigbuild. But that guard exists only on the VM driver path. Every docker/podman host-staging path lacks it.

Separately, the comment at tasks/scripts/gateway-docker.sh:114-117 claims the cross-compile "happens inside Linux containers — sidestepping macOS's per-process file-descriptor cap." That is stale/incorrect and directly contradicts architecture/build.md:58-60 ("Neither Dockerfile compiles Rust — both copy a staged binary"). The compile happens on the host.

Affected Components

Component Key Files Role
Host binary staging (chokepoint) tasks/scripts/stage-prebuilt-binaries.sh, tasks/scripts/docker-build-image.sh Cross-compiles openshell-sandbox/openshell-gateway on the host via cargo zigbuild and stages them into the Docker build context. Single point all docker/podman builds funnel through.
Docker gateway path tasks/scripts/gateway.sh, tasks/scripts/gateway-docker.sh mise run gateway entrypoint; detects driver and triggers the supervisor build. Carries the stale in-container comment.
Podman gateway path tasks/scripts/gateway.sh (ensure_podman_supervisor_image) Builds the supervisor sideload image via the same host-staging path on first run.
VM build (reference impl) tasks/scripts/vm/build-supervisor-bundle.sh Already contains the fd-limit guard — the pattern to lift into a shared helper.
Supervisor image deploy/docker/Dockerfile.supervisor scratch image; only COPYs the prebuilt host binary (never compiles).

Technical Investigation

Architecture Overview

mise run gatewaygateway.sh:92-111 detect_driver() (order: in-cluster k8s → podman → docker) → dispatches to the driver path. On the docker path it execs gateway-docker.sh (gateway.sh:247). In the cross-compile branch (gateway-docker.sh:146, i.e. macOS or arch mismatch), it calls docker-build-image.sh supervisor-output (:157). docker-build-image.sh:164 runs ensure_prebuilt_binaries, which — when not in CI and PREBUILT_AUTO_STAGE != 0 (:77) — runs stage-prebuilt-binaries.sh on the host (:82). For the musl supervisor target that selects cargo zigbuild (stage-prebuilt-binaries.sh:174-175) and invokes it at :208. The resulting binary is copied into the scratch supervisor image (Dockerfile.supervisor:22,33).

The podman path (gateway.sh:269-272ensure_podman_supervisor_image:180 mise run build:docker:supervisor) funnels into the identical docker-build-image.shstage-prebuilt-binaries.sh host build.

Code References

Location Description
tasks/scripts/gateway.sh:92-111 Driver detection (k8s → podman → docker)
tasks/scripts/gateway-docker.sh:114-117 Stale comment claiming in-container cross-compile (incorrect)
tasks/scripts/gateway-docker.sh:146-157 Cross-compile branch → docker-build-image.sh supervisor-output
tasks/scripts/docker-build-image.sh:77,82,164 ensure_prebuilt_binaries; host-staging gate + call to stage-prebuilt-binaries.sh
tasks/scripts/stage-prebuilt-binaries.sh:110-114 Supervisor resolves to openshell-sandbox, musl libc
tasks/scripts/stage-prebuilt-binaries.sh:174-175,208 Selects cargo zigbuild for musl; the unguarded host invocation
deploy/docker/Dockerfile.supervisor:22,33 FROM scratch; bare COPY of prebuilt binary
tasks/scripts/vm/build-supervisor-bundle.sh:61-116,125 ensure_build_nofile_limit() definition + call — the guard to generalize
architecture/build.md:58-60 Documents that Dockerfiles don't compile Rust (contradicts the code comment)
mise.toml:59 Global RUSTC_WRAPPER = "sccache" (holds extra fds; compounds EMFILE)

Current Behavior

On macOS with the default ulimit -n 256, mise run gateway (docker or podman driver) reaches the host cargo zigbuild for *-unknown-linux-musl and the link fails:

error: unable to search for static library …/libwant-*.rlib: ProcessFdQuotaExceeded
error: could not compile `openshell-sandbox` (bin "openshell-sandbox")

Reproduced deterministically: ulimit -n 256 → fails on libwant; ulimit -n 8192 (same objects, re-link only) → succeeds. The VM path (gateway:vm) does not fail because it raises the limit first.

Complete unguarded call-site inventory

Every host-side musl/zigbuild cross-compile path and whether it has an fd guard:

Entry point Reaches host zigbuild via Guard?
mise run gateway (docker) gateway-docker.sh:157docker-build-image.sh:82stage-prebuilt-binaries.sh:175 NO (reported bug)
mise run gateway (podman, first run) gateway.sh:180docker-build-image.sh:82stage-prebuilt-binaries.sh:175 NO
mise run build:docker:supervisor docker.toml:31docker-build-image.sh:82stage-prebuilt-binaries.sh:175 NO
mise run build:docker:gateway docker.toml:26docker-build-image.sh:82stage-prebuilt-binaries.sh:168 (gateway, zigbuild GNU 2.28) NO
mise run stage-prebuilt (… all) direct → stage-prebuilt-binaries.sh:168/175 NO
docker-publish-multiarch.sh:47,51,99 docker-build-image.sh:82stage-prebuilt-binaries.sh NO
mise run gateway:vm / mise run vm:supervisor vm/build-supervisor-bundle.sh:137 YES (:125)

Not affected: deploy/docker/cross-build.sh (runs inside a Linux build container), CI workflows (Linux runners), and Dockerfile.*-macos (native aarch64-apple-darwin, not cross/musl).

What Would Need to Change

  1. New shared helper tasks/scripts/build-env.sh exporting ensure_build_nofile_limit(), lifted from vm/build-supervisor-bundle.sh. Preserve both early-returns verbatim: [ "$(uname -s)" = "Darwin" ] || return 0 and command -v cargo-zigbuild >/dev/null 2>&1 || return 0. Keep the hard-limit clamp logic (macOS hard limit can be unlimited).
  2. tasks/scripts/stage-prebuilt-binaries.sh — source build-env.sh and call ensure_build_nofile_limit once near the top of main (before the per-arch loop). This single chokepoint fixes docker + podman + all docker:*/multiarch paths at once.
  3. tasks/scripts/vm/build-supervisor-bundle.sh — replace its inlined copy with the shared helper (de-dup).
  4. tasks/scripts/gateway-docker.sh:114-117 — fix the stale comment to state the cross-compile happens on the host and the limit is raised automatically.
  5. Optional: generalize the env var to OPENSHELL_BUILD_NOFILE_LIMIT (default 8192), honoring the existing OPENSHELL_VM_BUILD_NOFILE_LIMIT for back-compat.

Alternative Approaches Considered

  • Put the helper in tasks/scripts/container-engine.sh (already sourced by docker-build-image.sh). Rejected: that lib runs container-engine detection and errors if no engine is installed; the VM path must not depend on it. The fd guard is orthogonal to engine selection.
  • Guard in docker-build-image.sh instead of stage-prebuilt-binaries.sh. Weaker: misses mise run stage-prebuilt and any direct staging invocation. stage-prebuilt-binaries.sh is the true chokepoint.
  • Document-only ("run ulimit -n 8192"). Rejected: pushes a papercut onto every macOS contributor on the default path; the VM path already proves auto-raising is the right call.

Patterns to Follow

ensure_build_nofile_limit() in tasks/scripts/vm/build-supervisor-bundle.sh:61-116 is the exact, battle-tested pattern — including the hard-limit clamp and the sccache-EMFILE retry (:157-167, greps Too many open files|os error 24 and retries with env -u RUSTC_WRAPPER). ulimit -n set in the shell propagates to the cargo/zig children the scripts spawn (confirmed: cargo runs as a child of the same shell via mise x --, not a fresh login shell).

Proposed Approach

Extract the existing VM fd-limit guard into a small shared tasks/scripts/build-env.sh, then call it from the host-staging chokepoint (stage-prebuilt-binaries.sh) so all docker/podman/multiarch host cross-compiles raise ulimit -n before cargo zigbuild, exactly as the VM path already does. De-duplicate the VM script to use the shared helper, and correct the misleading in-container comment in gateway-docker.sh. The helper stays a no-op on Linux and when zigbuild is absent, so CI and Linux dev are unaffected. The raised limit is the primary fix; porting the VM path's sccache-EMFILE retry into the stage path is a possible follow-up if EMFILE recurs under sccache pressure.

Scope Assessment

  • Complexity: Low
  • Confidence: High — every step traced to file:line; the fix reuses an in-tree, proven helper; CI/Linux safety is guaranteed by two independent gates (uname early-return + the CI gate at docker-build-image.sh:77).
  • Estimated files to change: 3 core (tasks/scripts/build-env.sh new, stage-prebuilt-binaries.sh, vm/build-supervisor-bundle.sh) + up to 3 optional (gateway-docker.sh comment, a test, architecture/build.md line).
  • Issue type: fix

Risks & Open Questions

  • Env var naming (decision): introduce OPENSHELL_BUILD_NOFILE_LIMIT vs. reuse the VM-specific OPENSHELL_VM_BUILD_NOFILE_LIMIT. Recommend the general name with back-compat for the old one.
  • sccache interaction: mise.toml:59 sets RUSTC_WRAPPER=sccache globally, which holds extra fds. The stage path lacks the VM path's sccache-EMFILE retry. Raising the limit should resolve it; decide whether to also port the retry now or as follow-up.
  • Helper must be safe to call unconditionally: stage-prebuilt-binaries.sh is not itself CI-gated (only its caller is), so the uname/zigbuild early-returns must live in the helper, not rely on CI gating.

Test Considerations

  • No bash unit/bats harness or shellcheck job currently exists for tasks/scripts/ (only tasks/scripts/test-install-sh.sh, install.sh-specific).
  • Recommended low-cost test: source build-env.sh, set ulimit -n 256 in a subshell, call ensure_build_nofile_limit, assert the limit rose on macOS / is a no-op on Linux (gate the assertion on uname).
  • Definitive manual verification (already reproduced): ulimit -n 256; mise run gateway (docker driver) fails before the fix; succeeds after. Include this in the PR's testing section.
  • Consider a separate chore to add a shellcheck lint over tasks/scripts/*.sh.

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions