Complete documentation for madengine - AI model automation and distributed benchmarking platform.
| Guide | Description |
|---|---|
| Installation | Complete installation instructions |
| Usage Guide | Commands, configuration, and examples (--skip-model-run) |
| Guide | Description |
|---|---|
| Configuration | Advanced configuration options (includes run log error pattern scan) |
| Batch Build | Selective builds with batch manifests |
| Deployment | Kubernetes and SLURM deployment |
| Launchers | Multi-node training frameworks |
| Guide | Description |
|---|---|
| Profiling | Performance analysis tools |
| Contributing | How to contribute to madengine |
| Guide | Description |
|---|---|
| CLI Reference | Complete command-line options and examples |
The architecture diagram (Orchestration, Infrastructure, and Launcher layers) is in the main README. Summary:
- CLI Layer - User interface with 5 commands (discover, build, run, report, database)
- Model Discovery - Find and validate models from MAD package
- Orchestration - BuildOrchestrator & RunOrchestrator manage workflows
- Execution Targets - Local Docker, Kubernetes Jobs, or SLURM Jobs
- Distributed Launchers - Training (torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus) and Inference (vLLM, SGLang)
- Performance Output - CSV/JSON results with metrics
- Post-Processing - Report generation (HTML/Email) and database upload (MongoDB)
- Main Repository: https://github.com/ROCm/madengine
- MAD Package: https://github.com/ROCm/MAD
- Issues: https://github.com/ROCm/madengine/issues
- ROCm Documentation: https://rocm.docs.amd.com/
Run a model locally → Installation → Usage Guide
Deploy to Kubernetes → Configuration → Deployment
Deploy to SLURM → Configuration → Deployment
Build multiple models selectively (CI/CD) → Batch Build
Profile model performance → Profiling
Multi-node distributed training → Launchers → Deployment
Contribute to madengine → Contributing
madengine operates within the MAD (Model Automation and Dashboarding) ecosystem. The MAD package contains:
- Model definitions (
models.json) - Execution scripts (
run.sh) - Docker configurations
- Data provider configurations (
data.json) - Credentials (
credential.json)
madengine - Modern CLI with:
- Rich terminal output
- Distributed deployment support (K8s, SLURM)
- Build/run separation
- Manifest-based execution
- Local - Docker containers on local machine
- Kubernetes - Cloud-native container orchestration
- SLURM - HPC cluster job scheduling
- torchrun - PyTorch DDP/FSDP
- deepspeed - ZeRO optimization
- megatron - Large transformers (K8s + SLURM)
- torchtitan - LLM pre-training
- vllm - LLM inference
- sglang - Structured generation
This documentation follows these principles:
- Task-oriented - Organized by what users want to accomplish
- Progressive disclosure - Start simple, add complexity as needed
- Examples first - Show working examples before explaining details
- Consistent naming - Files follow simple naming pattern (no prefixes)
- Up-to-date - Reflects current implementation (v2.0)
Documentation improvements are welcome! Please:
- Keep examples working and tested
- Use consistent formatting and style
- Update cross-references when moving content
- Mark deprecated content clearly
- Follow the existing structure
See Contributing Guide for details.
madengine is licensed under the MIT License. See LICENSE for details.