Skip to content

Add cloud-hypervisor VMM backend for Linux hosts#782

Open
crosbymichael wants to merge 1 commit into
apple:mainfrom
crosbymichael:cloud-hypervisor
Open

Add cloud-hypervisor VMM backend for Linux hosts#782
crosbymichael wants to merge 1 commit into
apple:mainfrom
crosbymichael:cloud-hypervisor

Conversation

@crosbymichael

Copy link
Copy Markdown
Contributor

apple/containerization currently runs containers in per-container VMs on macOS hosts via Virtualization.framework. This adds a second VMM backend so the same Swift orchestration layer (LinuxContainer / LinuxPod / Vminitd gRPC contract) runs on Linux hosts via cloud-hypervisor + KVM.

CloudHypervisor Swift package (Sources/CloudHypervisor/) — a thin client for cloud-hypervisor's REST-over-UDS API, layered on AsyncHTTPClient. Endpoints cover VMM / VM lifecycle / hotplug (disk, fs, net, vsock, remove-device). Cross-platform (compiles on macOS for unit tests; consumed at runtime only by the Linux side of Containerization).

CH backend in Containerization — one cloud-hypervisor subprocess per VM, gated behind #if os(Linux). CHVirtualMachineManager / CHVirtualMachineInstance mirror the VZ shape behind the existing VirtualMachineManager / VirtualMachineInstance protocol. CHProcess and VirtiofsdProcess manage the binaries; CHHotplugProvider handles virtio-blk and virtio-fs runtime hotplug (with one virtiofsd per unique source-hash tag, refcounted across containers).

Linux host networking — BridgeManager brings up a Linux bridge with an IPv4 subnet and (opt-in via --enable-nat) iptables MASQUERADE + scoped FORWARD rules. LinuxBridgedNetwork enslaves a fresh TAP per container to the bridge. State is recorded under /run/containerization so cctl bridge delete reverses exactly what create did. Bridge teardown verifies the link kind via sysfs to refuse deleting non-bridge interfaces.

cctl run / bridge — end-to-end Linux container run path (image pull, ext4 rootfs assembly, VM boot, container exec) plus cctl bridge create|delete for the host network plumbing.

Build & distmake linux-build / make linux-integration build and exercise the host side inside an apple/container --virtualization dev container. make dist-x86_64 produces a deployment tarball (cctl + cloud-hypervisor + virtiofsd + initfs + kernel) cross-compiled from the aarch64 dev container; pipeline documented in docs/x86_64-build.md. Static-musl C deps and the Zig cross compiler are pinned by SHA256.

The host orchestrator runs as root. Per-VM runtime state lives under /run/containerization/ch/<UUID> with mode 0700; UDS sockets inside are bound with mode 0600. Vminitd's gRPC channel inherits that trust boundary — socket-file perms are the auth.

Sandbox flags are upstream-secure by default. Two per-component opt-outs exist for the apple/container dev-container case (where the host seccomp profile SIGSYS-kills CH and virtiofsd):

  • CONTAINERIZATION_NO_CH_SECCOMP=1cloud-hypervisor --seccomp false.
  • CONTAINERIZATION_NO_VIRTIOFSD_SANDBOX=1virtiofsd --sandbox none. Each emits a one-shot logger.warning at process start. Legacy alias CONTAINERIZATION_RELAXED_SANDBOX=1 flips both. cctl spawns both binaries with setsid and a minimal env allowlist (PATH / HOME / RUST_LOG / RUST_BACKTRACE) so the parent's secrets don't leak to children.

make linux-integration runs the cross-platform integration suite against a real cloud-hypervisor VM inside the dev container. Linux runs the cross-platform subset (process true/false/echo hi, virtiofs round-trip, hotplug); the macOS suite is unchanged.

apple/containerization currently runs containers in per-container VMs
on macOS hosts via Virtualization.framework. This adds a second VMM
backend so the same Swift orchestration layer (LinuxContainer /
LinuxPod / Vminitd gRPC contract) runs on Linux hosts via
cloud-hypervisor + KVM.

**CloudHypervisor Swift package** (`Sources/CloudHypervisor/`) — a thin
client for cloud-hypervisor's REST-over-UDS API, layered on
AsyncHTTPClient. Endpoints cover VMM / VM lifecycle / hotplug
(disk, fs, net, vsock, remove-device). Cross-platform (compiles on
macOS for unit tests; consumed at runtime only by the Linux side of
Containerization).

**CH backend in Containerization** — one cloud-hypervisor subprocess
per VM, gated behind `#if os(Linux)`. CHVirtualMachineManager /
CHVirtualMachineInstance mirror the VZ shape behind the existing
VirtualMachineManager / VirtualMachineInstance protocol. CHProcess
and VirtiofsdProcess manage the binaries; CHHotplugProvider handles
virtio-blk and virtio-fs runtime hotplug (with one virtiofsd per
unique source-hash tag, refcounted across containers).

**Linux host networking** — BridgeManager brings up a Linux bridge
with an IPv4 subnet and (opt-in via `--enable-nat`) iptables
MASQUERADE + scoped FORWARD rules. LinuxBridgedNetwork enslaves a
fresh TAP per container to the bridge. State is recorded under
`/run/containerization` so `cctl bridge delete` reverses exactly
what create did. Bridge teardown verifies the link kind via sysfs
to refuse deleting non-bridge interfaces.

**cctl run / bridge** — end-to-end Linux container run path (image
pull, ext4 rootfs assembly, VM boot, container exec) plus
`cctl bridge create|delete` for the host network plumbing.

**Build & dist** — `make linux-build` / `make linux-integration`
build and exercise the host side inside an apple/container
`--virtualization` dev container. `make dist-x86_64` produces a
deployment tarball (cctl + cloud-hypervisor + virtiofsd + initfs +
kernel) cross-compiled from the aarch64 dev container; pipeline
documented in `docs/x86_64-build.md`. Static-musl C deps and the
Zig cross compiler are pinned by SHA256.

The host orchestrator runs as root. Per-VM runtime state lives under
`/run/containerization/ch/<UUID>` with mode 0700; UDS sockets inside
are bound with mode 0600. Vminitd's gRPC channel inherits that trust
boundary — socket-file perms are the auth.

Sandbox flags are upstream-secure by default. Two per-component
opt-outs exist for the apple/container dev-container case (where the
host seccomp profile SIGSYS-kills CH and virtiofsd):
- `CONTAINERIZATION_NO_CH_SECCOMP=1` — `cloud-hypervisor --seccomp false`.
- `CONTAINERIZATION_NO_VIRTIOFSD_SANDBOX=1` — `virtiofsd --sandbox none`.
Each emits a one-shot `logger.warning` at process start. Legacy
alias `CONTAINERIZATION_RELAXED_SANDBOX=1` flips both. cctl spawns
both binaries with `setsid` and a minimal env allowlist
(PATH / HOME / RUST_LOG / RUST_BACKTRACE) so the parent's secrets
don't leak to children.

`make linux-integration` runs the cross-platform integration suite
against a real cloud-hypervisor VM inside the dev container. Linux
runs the cross-platform subset (`process true`/`false`/`echo hi`,
virtiofs round-trip, hotplug); the macOS suite is unchanged.

Signed-off-by: michael_crosby <michael_crosby@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant