Mesh-Sharded Distributed Transolver-3

Scaling Transformer Solvers to Industrial-Scale Geometries (100M+ cells).

Based on the Transolver paper (ICML 2024 Spotlight) and the Transolver-3 paper.

Context

🌊 Traditional CFD solves Navier-Stokes on fine meshes using HPC clusters — a single DrivAerML car aerodynamics run with 140M cells takes hours on hundreds of CPU cores.

🧠 Transolver replaces the iterative PDE solver with a transformer that learns the physics directly from data, predicting pressure, velocity, and other fields in a single forward pass.

🔬 Transolver-3 scales this to industrial-scale meshes (100M+ cells) through physics-aware attention in a compressed "slice domain" of only 64 slices.

🖥️ Mesh-sharded DDP distributes meshes too large for a single GPU across multiple GPUs — each processes its local partition and all-reduces only the tiny slice accumulators (~514 KB/layer).

⚡ The result: 10-100× faster than classical solvers at engineering-grade accuracy.

Key Innovations

Faster Slice & Deslice — Linear projections moved from O(N) mesh domain to O(M) slice domain via matrix multiplication associativity
Geometry Slice Tiling — Input partitioned into tiles with gradient checkpointing, reducing peak memory from O(NM) to O(N_t*M)
Geometry Amortized Training — Train on random subsets (100K-400K) of full mesh each iteration
Physical State Caching — Two-phase inference: build cache from chunks, decode any point
Mixed Precision — Full autocast + GradScaler support, halving memory footprint
Mesh-Sharded Distribution — Shard meshes >100 GB across GPUs; all-reduce only the tiny slice accumulators (~514 KB/layer)

Setup on Databricks

Notebook (first cell)

%pip install /Workspace/Repos/<user>/Transolver -q
dbutils.library.restartPython()

DAB deployment

databricks bundle deploy -t a10g       # 4x A10G (96 GB) — default
databricks bundle deploy -t a100_40    # 8x A100 40GB
databricks bundle deploy -t a100_80    # 8x A100 80GB

DAB Training Pipeline

The full pipeline runs 5 sequential tasks, each on its own cluster. MLflow is the single source of truth for model artifacts — no checkpoint files are passed between tasks.

databricks bundle deploy -t a10g
databricks bundle run transolver3_training_pipeline

Task	Cluster	What it does
preprocess	i3.xlarge (CPU)	Register mesh metadata + compute stats in Delta
train	g5.12xlarge (4x A10G)	Mesh-sharded DDP training via TorchDistributor, live MLflow metrics
evaluate	g5.12xlarge (4x A10G)	Load model from MLflow run, run cached inference on test set
register	i3.xlarge (CPU)	Promote already-logged model to UC Model Registry
deploy	i3.xlarge (CPU)	Create/update Model Serving endpoint with scale-to-zero

The train task uses TorchDistributor(local_mode=True) to launch torchrun on a single multi-GPU node. Each GPU loads a disjoint 1/K shard of the mesh via mmap range reads. Gradients are all-reduced via NCCL. See SPECS/DISTRIBUTED_ARCHITECTURE.md for Mermaid diagrams of the full architecture.

Other DAB jobs

databricks bundle run gpu_memory_benchmark          # Single-GPU memory sweep
databricks bundle run distributed_sharded_test      # 2-GPU validation
databricks bundle run test_mlflow_auth              # Smoke test MLflow auth in child processes
databricks bundle run test_register_deploy          # Serverless register + deploy (fast iteration)

Claude Skills (Newcomer Guide)

Four Claude Code skills in skills/ provide step-by-step guidance for newcomers. All skills target Databricks notebooks and DABs — no local setup required.

Skill	Purpose
transolver-data	Load, inspect, validate `.npz` meshes in UC Volumes; normalization; memory estimation
transolver-run	Config presets (small/medium/large), training in notebooks, 3-phase pipeline, TorchDistributor, DAB workflows
transolver-analyze	Loss interpretation, per-channel error stats, physical bounds checking, PSI drift detection, GPU profiling
transolver-deploy	MLflow tracking, UC model registration, serving endpoints, inference table monitoring, end-to-end checklist

File Structure

transolver3/                          # Core package
├── physics_attention_v3.py           # Optimized Physics-Attention
├── transolver3_block.py              # Encoder block with tiled MLP
├── model.py                          # Transolver3 model
├── amortized_training.py             # Training (sampler, loss, scheduler, train_step)
├── inference.py                      # CachedInference + DistributedCachedInference
├── distributed.py                    # Multi-GPU mesh sharding utilities
├── normalizer.py                     # InputNormalizer, TargetNormalizer
├── profiling.py                      # Memory/latency benchmarking
├── serving.py                        # MLflow pyfunc wrapper for Model Serving
├── mlflow_utils.py                   # Experiment tracking + model logging
├── data_catalog.py                   # Delta Lake mesh metadata integration
├── databricks_training.py            # TorchDistributor launcher + Spark preprocessing
├── monitoring.py                     # Bounds checking + PSI drift detection
└── common.py                         # MLP, activations, timestep_embedding

resources/                            # DAB job definitions
├── training_workflow.yml             # 5-task pipeline (preprocess → train → evaluate → register → deploy)
├── serving_endpoint.yml              # Model Serving endpoint config
├── gpu_benchmark_job.yml             # Single-GPU memory benchmark
├── distributed_test_job.yml          # 2-GPU mesh-sharded test
├── test_mlflow_auth_job.yml          # Smoke test MLflow auth in TorchDistributor children
├── test_register_job.yml             # Serverless checkpoint inspection test
└── test_register_deploy_job.yml      # Serverless register + deploy test

scripts/                              # Entry points for DAB tasks
├── preprocess.py                     # Register mesh metadata + compute stats
├── register_model.py                 # Promote model from MLflow run to UC registry
├── deploy_endpoint.py                # Deploy to Databricks serving endpoint
├── test_mlflow_auth.py               # MLflow auth propagation smoke test
└── test_register.py                  # Checkpoint inspection + model load test

skills/                               # Claude Code skills for newcomers
├── transolver-data.md                # Mesh data management
├── transolver-run.md                 # Training & simulation
├── transolver-analyze.md             # Results analysis & drift
└── transolver-deploy.md              # Databricks deployment lifecycle

Industrial-Scale-Benchmarks/          # Experiments
├── exp_nasa_crm.py                   # NASA-CRM (~400K cells)
├── exp_ahmed_ml.py                   # AhmedML (~20M cells)
├── exp_drivaer_ml.py                 # DrivAerML (~160M cells, single GPU)
├── exp_drivaer_ml_distributed.py     # DrivAerML distributed (multi-GPU)
├── dataset/                          # Dataset loaders (with mesh sharding)
└── utils/metrics.py                  # Evaluation metrics

experiments/                          # v1 vs v3 comparison
├── compare_v1_v3_drivaer.py          # Synthetic data comparison
├── compare_v1_v3_real_drivaer.py     # Real DrivAerML VTP data comparison
├── COMPARE_v1v3.md                   # Results and analysis
└── results/                          # Pressure heatmap PNGs

benchmarks/                           # GPU benchmarking
├── gpu_memory_benchmark.py           # Sweep mesh sizes, measure all 3 phases
└── test_sharded_distributed.py       # Distributed sharding validation test

SPECS/                                # Design documentation
├── SPEC.md                           # Core v3 architecture specification
├── DISTRIBUTED.md                    # Multi-GPU distribution design
├── DISTRIBUTED_ARCHITECTURE.md       # Mermaid diagrams: pipeline, process model, data flow
├── CRITICAL_ISSUES.md                # Known issues & fixes
├── DIFFERENTIATORS.md                # Why Databricks is ideal
└── VALUEADDED.md                     # Databricks integration roadmap

tests/                                # 100 tests
├── test_transolver3.py               # Core model tests (41)
├── test_distributed.py               # Distributed sharding tests (11)
├── test_serving.py                   # Serving tests (4)
├── test_monitoring.py                # Monitoring tests (5)
├── test_data_catalog.py              # Catalog tests (6)
├── test_mlflow_utils.py              # MLflow tests (4)
└── test_databricks_training.py       # Training integration + auth propagation tests (12)

Memory Scaling

With tile_size=100K and fp16, the paper's claim of 2.9M cells on a single A100 80GB is achievable (~14 GB activations).

Profiling (notebook on GPU cluster)

from transolver3.profiling import benchmark_scaling, format_benchmark_table

results = benchmark_scaling(model, mesh_sizes=[1000, 10000, 100000],
    configs=[
        {'label': 'no_tiling', 'num_tiles': 0},
        {'label': 'tile_100k', 'tile_size': 100_000},
    ])
print(format_benchmark_table(results))

Multi-GPU Distribution

Transolver-3 already handles industrial-scale meshes on a single GPU via amortized subsampling and tiled attention. Multi-GPU distribution shards the mesh across GPUs to parallelize computation and reduce wall-clock time. Each GPU processes 1/K of the mesh independently; the slice accumulators s_raw (B,H,M,C) are additive and all-reduced (~514 KB/layer).

Validated on 4x NVIDIA A10G: sharded cache and decode produce zero numerical difference vs single-GPU. See SPECS/DISTRIBUTED.md for the original design.

GPU Benchmark (DAB)

Three DAB targets map to different GPU instances:

Target	Instance	GPU	VRAM	Use case
`a10g` (default)	`g5.12xlarge`	4x NVIDIA A10G	96 GB	Multi-GPU training, benchmarks
`a100_40`	`p4d.24xlarge`	8x NVIDIA A100	320 GB	Large-scale training
`a100_80`	`p4de.24xlarge`	8x NVIDIA A100	640 GB	Full-scale DrivAerML

databricks bundle deploy -t a10g
databricks bundle run gpu_memory_benchmark          # Single-GPU memory sweep
databricks bundle run distributed_sharded_test      # 2-GPU validation
databricks bundle run training_workflow             # Full 5-task pipeline

The benchmark sweeps mesh sizes and measures peak GPU memory across all 3 pipeline phases (training, cache build, decode) using synthetic DrivAer ML data.

v1 vs v3 Comparison

Includes experiments comparing Transolver v1 and v3 on both synthetic and real DrivAerML data. See experiments/COMPARE_v1v3.md for full results and pressure heatmaps on the real DrivAer vehicle.

Citation

@inproceedings{wu2024Transolver,
  title={Transolver: A Fast Transformer Solver for PDEs on General Geometries},
  author={Haixu Wu and Huakun Luo and Haowen Wang and Jianmin Wang and Mingsheng Long},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@article{wu2026Transolver3,
  title={Transolver++: Industrial-Scale Simulation with Transformer Solvers},
  author={Haixu Wu and Huakun Luo and Haowen Wang and Jianmin Wang and Mingsheng Long},
  journal={arXiv preprint arXiv:2602.04940},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
Industrial-Scale-Benchmarks		Industrial-Scale-Benchmarks
SPECS		SPECS
benchmarks		benchmarks
experiments		experiments
resources		resources
scripts		scripts
skills		skills
tests		tests
transolver3		transolver3
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
WORKFLOW.md		WORKFLOW.md
databricks.yml		databricks.yml
memory_scaling.png		memory_scaling.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mesh-Sharded Distributed Transolver-3

Context

Key Innovations

Setup on Databricks

Notebook (first cell)

DAB deployment

DAB Training Pipeline

Other DAB jobs

Claude Skills (Newcomer Guide)

File Structure

Memory Scaling

Profiling (notebook on GPU cluster)

Multi-GPU Distribution

GPU Benchmark (DAB)

v1 vs v3 Comparison

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mesh-Sharded Distributed Transolver-3

Context

Key Innovations

Setup on Databricks

Notebook (first cell)

DAB deployment

DAB Training Pipeline

Other DAB jobs

Claude Skills (Newcomer Guide)

File Structure

Memory Scaling

Profiling (notebook on GPU cluster)

Multi-GPU Distribution

GPU Benchmark (DAB)

v1 vs v3 Comparison

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages