Deploy madengine workloads to Kubernetes or SLURM clusters for distributed execution.
madengine supports two deployment backends:
- Kubernetes - Cloud-native container orchestration
- SLURM - HPC cluster job scheduling
Deployment is configured via --additional-context and happens automatically during the run phase.
┌─────────────────────────────────────────────┐
│ 1. Build Phase (Local or CI/CD) │
│ madengine build --tags model │
│ → Creates Docker image │
│ → Pushes to registry │
│ → Generates build_manifest.json │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ 2. Deploy Phase (Run with Context) │
│ madengine run │
│ --manifest-file build_manifest.json │
│ --additional-context '{"deploy":...}' │
│ → Detects deployment target │
│ → Creates K8s Job or SLURM script │
│ → Submits and monitors execution │
└─────────────────────────────────────────────┘
- Kubernetes cluster with GPU support
- GPU device plugin installed (AMD or NVIDIA)
- Kubeconfig configured (
~/.kube/configor in-cluster) - Docker registry accessible from cluster
{
"k8s": {
"gpu_count": 1
}
}This automatically applies intelligent defaults for namespace, resources, image pull policy, etc.
# 1. Build image
madengine build --tags my_model \
--registry my-registry.io \
--additional-context-file k8s-config.json
# 2. Deploy to Kubernetes
madengine run \
--manifest-file build_manifest.json \
--timeout 3600The deployment target is automatically detected from the k8s key in the config.
k8s-config.json:
{
"k8s": {
"gpu_count": 2,
"namespace": "ml-team",
"gpu_vendor": "AMD",
"memory": "32Gi",
"cpu": "16",
"service_account": "madengine-sa",
"image_pull_policy": "Always"
}
}Configuration Priority:
- User config (
--additional-context-file) - Profile presets (single-gpu/multi-gpu)
- GPU vendor presets (AMD/NVIDIA)
- Base defaults
See examples/k8s-configs/ for complete examples.
By default (k8s.secrets.strategy: from_local_credentials), madengine run creates Kubernetes Secrets from a local credential.json when present: Docker Hub pull credentials (when configured) and an opaque Secret for runtime use. Credentials are not embedded in the ConfigMap in that case. For GitOps or clusters without client-side files, use existing or omit and set k8s.secrets.image_pull_secret_names / k8s.secrets.runtime_secret_name as needed. See Configuration and examples/k8s-configs/README.md.
With "debug": true in additional context, madengine run writes rendered manifests under ./k8s_manifests (or the path you configure). To lint those YAML files against the Kubernetes OpenAPI schema, install kubeconform and run from the repository root:
./tests/scripts/k8s_validate_manifests.sh ./k8s_manifestsThe script exits successfully if kubeconform is missing (skip) or if validation passes.
For distributed training across multiple nodes:
{
"k8s": {
"gpu_count": 8
},
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 4
}
}This creates:
- Kubernetes Indexed Job with 2 completions
- Headless service for pod discovery
- Automatic rank assignment via
JOB_COMPLETION_INDEX MAD_MULTI_NODE_RUNNERenvironment variable with torchrun command
Supported Launchers:
torchrun- PyTorch DDP/FSDPdeepspeed- ZeRO optimizationmegatron- Megatron-LM trainingtorchtitan- LLM pre-trainingprimus- Primus unified pretrain (Megatron / TorchTitan / MaxText YAML)vllm- LLM inferencesglang- Structured generationsglang-disagg- Disaggregated SGLang (multi-node)
See Launchers Guide for details.
# Check job status
kubectl get jobs -n your-namespace
# View pod logs
kubectl logs -f job/madengine-job-xxx -n your-namespace
# Check pod status
kubectl get pods -n your-namespaceFinished Jobs are not removed unless you set k8s.ttl_seconds_after_finished to a positive number of seconds; the Job manifest then includes ttlSecondsAfterFinished so the control plane can garbage-collect the Job after it finishes. The deploy step may still delete Secrets it created when cleaning up a failed or cancelled deploy—see runtime logs for details.
Manual cleanup:
kubectl delete job madengine-job-xxx -n your-namespace- Access to SLURM login node
- SLURM commands available (
sbatch,squeue,scontrol) - Shared filesystem for MAD package and results
- Module system or container runtime (Singularity/Apptainer)
slurm-config.json:
{
"slurm": {
"partition": "gpu",
"gpus_per_node": 4,
"time": "02:00:00",
"account": "my_account"
}
}# 1. Build image (on build node or locally)
madengine build --tags my_model \
--registry my-registry.io \
--additional-context-file slurm-config.json
# 2. SSH to SLURM login node
ssh [email protected]
# 3. Deploy to SLURM
cd /shared/workspace
madengine run \
--manifest-file build_manifest.json \
--timeout 7200The deployment target is automatically detected from the slurm key in the config.
slurm-config.json:
{
"slurm": {
"partition": "gpu",
"account": "research_group",
"qos": "normal",
"gpus_per_node": 8,
"nodes": 1,
"time": "24:00:00",
"mail_user": "[email protected]",
"mail_type": "ALL"
}
}Common SLURM Options:
partition: SLURM partition nameaccount: Billing accountqos: Quality of Servicegpus_per_node: Number of GPUs per nodenodes: Number of nodes (for multi-node)nodelist: Comma-separated node names to run on (e.g."node01,node02"); when set, job runs only on these nodes and node health preflight is skippedtime: Wall time limit (HH:MM:SS)mem: Memory per node (e.g., "64G")mail_user: Email for job notificationsmail_type: Notification types (BEGIN, END, FAIL, ALL)
See examples/slurm-configs/ for complete examples.
For distributed training across SLURM nodes:
{
"slurm": {
"partition": "gpu",
"nodes": 4,
"gpus_per_node": 8,
"time": "48:00:00"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 4,
"nproc_per_node": 8
}
}SLURM automatically provides:
- Node list via
$SLURM_JOB_NODELIST - Master address detection
- Network interface configuration
- Rank assignment via
$SLURM_PROCID
# Check job queue
squeue -u $USER
# Monitor job progress
squeue -j <job_id>
# View job details
scontrol show job <job_id>
# Check output logs
tail -f slurm-<job_id>.out# Cancel job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER| Feature | Kubernetes | SLURM |
|---|---|---|
| Environment | Cloud, on-premise | HPC clusters |
| Orchestration | Automatic | Job scheduler |
| Dependencies | Python library (kubernetes) |
CLI commands only |
| Multi-node Setup | Headless service + DNS | SLURM env vars |
| Resource Management | Declarative (YAML) | Batch script |
| Best For | Cloud deployments, microservices | Academic HPC, supercomputers |
{
"k8s": {
"gpu_count": 1,
"namespace": "dev"
}
}{
"k8s": {
"gpu_count": 4,
"memory": "64Gi",
"cpu": "32"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 1,
"nproc_per_node": 4
}
}{
"k8s": {
"gpu_count": 8,
"namespace": "ml-training"
},
"distributed": {
"launcher": "torchtitan",
"nnodes": 4,
"nproc_per_node": 8
}
}{
"slurm": {
"partition": "gpu",
"gpus_per_node": 8,
"time": "12:00:00"
}
}{
"slurm": {
"partition": "gpu",
"nodes": 8,
"gpus_per_node": 8,
"time": "72:00:00",
"account": "research_proj"
},
"distributed": {
"launcher": "deepspeed",
"nnodes": 8,
"nproc_per_node": 8
}
}Image Pull Failures:
# Check image exists
docker pull <registry>/<image>:<tag>
# Verify image pull secrets
kubectl get secrets -n your-namespace
# Check pod events
kubectl describe pod <pod-name> -n your-namespaceResource Issues:
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
# Check GPU availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.'amd\.com/gpu'Job Pending:
# Check reason
squeue -j <job_id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
# Check partition status
sinfo -p gpuOut of Resources:
# Check available resources
sinfo -o "%P %.5a %.10l %.6D %.6t %N"
# Adjust resource requests in config- Use minimal configs with intelligent defaults
- Specify resource limits to prevent over-allocation
- Use appropriate namespaces for isolation
- Configure image pull policies based on registry location
- Monitor pod resource usage with
kubectl top
- Start with conservative time limits
- Use appropriate QoS for priority
- Monitor job efficiency with
seff <job_id> - Use shared filesystem for input/output
- Test with single node before scaling
- Launchers Guide - Distributed training and inference launchers
- K8s Examples - Complete Kubernetes configurations
- SLURM Examples - Complete SLURM configurations
- Usage Guide - General usage instructions