Scaling Transformer Solvers to Industrial-Scale Geometries (100M+ cells).
Based on the Transolver paper (ICML 2024 Spotlight) and the Transolver-3 paper.
π Traditional CFD solves Navier-Stokes on fine meshes using HPC clusters β a single DrivAerML car aerodynamics run with 140M cells takes hours on hundreds of CPU cores.
π§ Transolver replaces the iterative PDE solver with a transformer that learns the physics directly from data, predicting pressure, velocity, and other fields in a single forward pass.
π¬ Transolver-3 scales this to industrial-scale meshes (100M+ cells) through physics-aware attention in a compressed "slice domain" of only 64 slices.
π₯οΈ Mesh-sharded DDP distributes meshes too large for a single GPU across multiple GPUs β each processes its local partition and all-reduces only the tiny slice accumulators (~514 KB/layer).
β‘ The result: 10-100Γ faster than classical solvers at engineering-grade accuracy.
- Faster Slice & Deslice β Linear projections moved from O(N) mesh domain to O(M) slice domain via matrix multiplication associativity
- Geometry Slice Tiling β Input partitioned into tiles with gradient checkpointing, reducing peak memory from O(NM) to O(N_t*M)
- Geometry Amortized Training β Train on random subsets (100K-400K) of full mesh each iteration
- Physical State Caching β Two-phase inference: build cache from chunks, decode any point
- Mixed Precision β Full autocast + GradScaler support, halving memory footprint
- Mesh-Sharded Distribution β Shard meshes >100 GB across GPUs; all-reduce only the tiny slice accumulators (~514 KB/layer)
%pip install /Workspace/Repos/<user>/Transolver -q
dbutils.library.restartPython()databricks bundle deploy -t a10g # 4x A10G (96 GB) β default
databricks bundle deploy -t a100_40 # 8x A100 40GB
databricks bundle deploy -t a100_80 # 8x A100 80GBThe full pipeline runs 5 sequential tasks, each on its own cluster. MLflow is the single source of truth for model artifacts β no checkpoint files are passed between tasks.
databricks bundle deploy -t a10g
databricks bundle run transolver3_training_pipeline| Task | Cluster | What it does |
|---|---|---|
| preprocess | i3.xlarge (CPU) | Register mesh metadata + compute stats in Delta |
| train | g5.12xlarge (4x A10G) | Mesh-sharded DDP training via TorchDistributor, live MLflow metrics |
| evaluate | g5.12xlarge (4x A10G) | Load model from MLflow run, run cached inference on test set |
| register | i3.xlarge (CPU) | Promote already-logged model to UC Model Registry |
| deploy | i3.xlarge (CPU) | Create/update Model Serving endpoint with scale-to-zero |
The train task uses TorchDistributor(local_mode=True) to launch torchrun on a single multi-GPU node. Each GPU loads a disjoint 1/K shard of the mesh via mmap range reads. Gradients are all-reduced via NCCL. See SPECS/DISTRIBUTED_ARCHITECTURE.md for Mermaid diagrams of the full architecture.
databricks bundle run gpu_memory_benchmark # Single-GPU memory sweep
databricks bundle run distributed_sharded_test # 2-GPU validation
databricks bundle run test_mlflow_auth # Smoke test MLflow auth in child processes
databricks bundle run test_register_deploy # Serverless register + deploy (fast iteration)Four Claude Code skills in skills/ provide step-by-step guidance for newcomers. All skills target Databricks notebooks and DABs β no local setup required.
| Skill | Purpose |
|---|---|
| transolver-data | Load, inspect, validate .npz meshes in UC Volumes; normalization; memory estimation |
| transolver-run | Config presets (small/medium/large), training in notebooks, 3-phase pipeline, TorchDistributor, DAB workflows |
| transolver-analyze | Loss interpretation, per-channel error stats, physical bounds checking, PSI drift detection, GPU profiling |
| transolver-deploy | MLflow tracking, UC model registration, serving endpoints, inference table monitoring, end-to-end checklist |
transolver3/ # Core package
βββ physics_attention_v3.py # Optimized Physics-Attention
βββ transolver3_block.py # Encoder block with tiled MLP
βββ model.py # Transolver3 model
βββ amortized_training.py # Training (sampler, loss, scheduler, train_step)
βββ inference.py # CachedInference + DistributedCachedInference
βββ distributed.py # Multi-GPU mesh sharding utilities
βββ normalizer.py # InputNormalizer, TargetNormalizer
βββ profiling.py # Memory/latency benchmarking
βββ serving.py # MLflow pyfunc wrapper for Model Serving
βββ mlflow_utils.py # Experiment tracking + model logging
βββ data_catalog.py # Delta Lake mesh metadata integration
βββ databricks_training.py # TorchDistributor launcher + Spark preprocessing
βββ monitoring.py # Bounds checking + PSI drift detection
βββ common.py # MLP, activations, timestep_embedding
resources/ # DAB job definitions
βββ training_workflow.yml # 5-task pipeline (preprocess β train β evaluate β register β deploy)
βββ serving_endpoint.yml # Model Serving endpoint config
βββ gpu_benchmark_job.yml # Single-GPU memory benchmark
βββ distributed_test_job.yml # 2-GPU mesh-sharded test
βββ test_mlflow_auth_job.yml # Smoke test MLflow auth in TorchDistributor children
βββ test_register_job.yml # Serverless checkpoint inspection test
βββ test_register_deploy_job.yml # Serverless register + deploy test
scripts/ # Entry points for DAB tasks
βββ preprocess.py # Register mesh metadata + compute stats
βββ register_model.py # Promote model from MLflow run to UC registry
βββ deploy_endpoint.py # Deploy to Databricks serving endpoint
βββ test_mlflow_auth.py # MLflow auth propagation smoke test
βββ test_register.py # Checkpoint inspection + model load test
skills/ # Claude Code skills for newcomers
βββ transolver-data.md # Mesh data management
βββ transolver-run.md # Training & simulation
βββ transolver-analyze.md # Results analysis & drift
βββ transolver-deploy.md # Databricks deployment lifecycle
Industrial-Scale-Benchmarks/ # Experiments
βββ exp_nasa_crm.py # NASA-CRM (~400K cells)
βββ exp_ahmed_ml.py # AhmedML (~20M cells)
βββ exp_drivaer_ml.py # DrivAerML (~160M cells, single GPU)
βββ exp_drivaer_ml_distributed.py # DrivAerML distributed (multi-GPU)
βββ dataset/ # Dataset loaders (with mesh sharding)
βββ utils/metrics.py # Evaluation metrics
experiments/ # v1 vs v3 comparison
βββ compare_v1_v3_drivaer.py # Synthetic data comparison
βββ compare_v1_v3_real_drivaer.py # Real DrivAerML VTP data comparison
βββ COMPARE_v1v3.md # Results and analysis
βββ results/ # Pressure heatmap PNGs
benchmarks/ # GPU benchmarking
βββ gpu_memory_benchmark.py # Sweep mesh sizes, measure all 3 phases
βββ test_sharded_distributed.py # Distributed sharding validation test
SPECS/ # Design documentation
βββ SPEC.md # Core v3 architecture specification
βββ DISTRIBUTED.md # Multi-GPU distribution design
βββ DISTRIBUTED_ARCHITECTURE.md # Mermaid diagrams: pipeline, process model, data flow
βββ CRITICAL_ISSUES.md # Known issues & fixes
βββ DIFFERENTIATORS.md # Why Databricks is ideal
βββ VALUEADDED.md # Databricks integration roadmap
tests/ # 100 tests
βββ test_transolver3.py # Core model tests (41)
βββ test_distributed.py # Distributed sharding tests (11)
βββ test_serving.py # Serving tests (4)
βββ test_monitoring.py # Monitoring tests (5)
βββ test_data_catalog.py # Catalog tests (6)
βββ test_mlflow_utils.py # MLflow tests (4)
βββ test_databricks_training.py # Training integration + auth propagation tests (12)
With tile_size=100K and fp16, the paper's claim of 2.9M cells on a single A100 80GB is achievable (~14 GB activations).
from transolver3.profiling import benchmark_scaling, format_benchmark_table
results = benchmark_scaling(model, mesh_sizes=[1000, 10000, 100000],
configs=[
{'label': 'no_tiling', 'num_tiles': 0},
{'label': 'tile_100k', 'tile_size': 100_000},
])
print(format_benchmark_table(results))Transolver-3 already handles industrial-scale meshes on a single GPU via amortized subsampling and tiled attention. Multi-GPU distribution shards the mesh across GPUs to parallelize computation and reduce wall-clock time. Each GPU processes 1/K of the mesh independently; the slice accumulators s_raw (B,H,M,C) are additive and all-reduced (~514 KB/layer).
Validated on 4x NVIDIA A10G: sharded cache and decode produce zero numerical difference vs single-GPU. See SPECS/DISTRIBUTED.md for the original design.
Three DAB targets map to different GPU instances:
| Target | Instance | GPU | VRAM | Use case |
|---|---|---|---|---|
a10g (default) |
g5.12xlarge |
4x NVIDIA A10G | 96 GB | Multi-GPU training, benchmarks |
a100_40 |
p4d.24xlarge |
8x NVIDIA A100 | 320 GB | Large-scale training |
a100_80 |
p4de.24xlarge |
8x NVIDIA A100 | 640 GB | Full-scale DrivAerML |
databricks bundle deploy -t a10g
databricks bundle run gpu_memory_benchmark # Single-GPU memory sweep
databricks bundle run distributed_sharded_test # 2-GPU validation
databricks bundle run training_workflow # Full 5-task pipelineThe benchmark sweeps mesh sizes and measures peak GPU memory across all 3 pipeline phases (training, cache build, decode) using synthetic DrivAer ML data.
Includes experiments comparing Transolver v1 and v3 on both synthetic and real DrivAerML data. See experiments/COMPARE_v1v3.md for full results and pressure heatmaps on the real DrivAer vehicle.
@inproceedings{wu2024Transolver,
title={Transolver: A Fast Transformer Solver for PDEs on General Geometries},
author={Haixu Wu and Huakun Luo and Haowen Wang and Jianmin Wang and Mingsheng Long},
booktitle={International Conference on Machine Learning},
year={2024}
}
@article{wu2026Transolver3,
title={Transolver++: Industrial-Scale Simulation with Transformer Solvers},
author={Haixu Wu and Huakun Luo and Haowen Wang and Jianmin Wang and Mingsheng Long},
journal={arXiv preprint arXiv:2602.04940},
year={2026}
}

