High-performance distributed training for LLMs — RL, SFT, MoE, and beyond.
🚀 Installation · ⚡ Quick Start · 📚 Documentation
XoRL is a distributed training framework designed for large language models with composable parallelism and flexible training modes.
Two training modes:
- Local —
torchrun-based training for offline SFT and pretraining - Server — REST API-driven training for online RL loops where an external orchestrator (e.g. xorl_client) controls the training loop
Parallelism strategies — mix and match freely:
| Strategy | Description |
|---|---|
| FSDP2 | Fully sharded data parallelism (PyTorch native) |
| Tensor Parallel | Column/row weight sharding across GPUs |
| Pipeline Parallel | Interleaved 1F1B schedule across stages |
| Context Parallel | Ring attention + Ulysses sequence parallel |
| Expert Parallel | MoE expert sharding via DeepEP |
Fine-tuning methods — full weights, LoRA, and QLoRA (int4/nvfp4/block_fp8), all FSDP2-compatible.
git clone --recurse-submodules git@github.com:togethercomputer/xorl.git
cd xorl
uv syncAlready cloned without
--recurse-submodules? Rungit submodule update --init --recursive
See the installation guide for full setup including optional dependencies (DeepEP, Flash Attention).
# Local training on 8 GPUs
torchrun --nproc_per_node=8 -m xorl.cli.train examples/local/dummy/configs/full/qwen3_8b.yamlSee the quick start guide for more examples including MoE, server training, and LoRA.
| Topic | Link |
|---|---|
| Parallelism | Overview |
| MoE & DeepEP | MoE docs |
| LoRA / QLoRA | Adapters |
| Server training | Server docs |
| Config reference | Local · Server |
| Model | Type | HuggingFace ID |
|---|---|---|
| Qwen3 | Dense | Qwen/Qwen3-8B, Qwen/Qwen3-32B, ... |
| Qwen3-MoE | Mixture-of-Experts | Qwen/Qwen3-30B-A3B, Qwen/Qwen3-235B-A22B, ... |
Models are loaded directly from HuggingFace checkpoints — no preprocessing needed. See the supported models page for details.
See CONTRIBUTING.md for development setup, coding conventions, and how to run tests.