The guide below explains the steps required to extend dstack with support for a new cloud provider.
The gpuhunt project is a utility that dstack uses to collect information
about cloud providers, their supported machine configurations, pricing, etc. This information is later used by dstack
for provisioning machines.
Thus, in order to support a new cloud provider with dstack, you first need to add the cloud provider to gpuhunt.
To add a new cloud provider to gpuhunt, follow these steps:
git clone https://github.com/dstackai/gpuhunt.git- Offline providers offer static machine configurations that are not frequently updated.
gpuhuntcollects offline providers' instance offers on an hourly basis. Examples:aws,gcp,azure, etc. - Online providers offer dynamic machine configurations that are available at the very moment
when you fetch configurations (e.g., GPU marketplaces).
gpuhuntcollects online providers' instance offers each time adstackuser provisions a new instance. Examples:tensordock,vastai, etc.
Create the provider class file under src/gpuhunt/providers.
Make sure your class extends the AbstractProvider
base class. See its docstrings for descriptions of the methods that your class should implement.
Refer to examples:
- Offline providers: verda.py, aws.py, azure.py, lambdalabs.py.
- Online providers: vultr.py tensordock.py, vastai.py.
Add your provider in the following places:
- Either
OFFLINE_PROVIDERSorONLINE_PROVIDERSinsrc/gpuhunt/_internal/catalog.py. - The
python -m gpuhuntcommand insrc/gpuhunt/__main__.py. - (offline providers) The CI workflow in
.github/workflows/catalogs.yml. - (online providers) The default catalog in
src/gpuhunt/_internal/default.py.
For offline providers, you can add data quality tests under src/integrity_tests/.
Data quality tests are run after collecting offline catalogs to ensure their integrity.
Refer to examples: test_verda.py, test_gcp.py.
Once the cloud provider is added, submit a pull request.
Anything unclear? Ask questions on the Discord server.
Once the provider is added to gpuhunt, we can proceed with implementing
the corresponding backend with dstack. Follow the steps below.
See the Appendix at the end of this document and make sure the provider meets the outlined requirements.
Follow DEVELOPMENT.md.
Add any dependencies required by your cloud provider to setup.py. Create a separate section with the provider's name for these dependencies, and ensure that you update the all section to include them as well.
Add a new enumeration member for your provider to BackendType (src/dstack/_internal/core/models/backends/base.py).
dstack provides a helper script to generate all the necessary files and classes for a new backend.
To add a new backend named ExampleXYZ, you should run:
python scripts/add_backend.py -n ExampleXYZIt will create an examplexyz backend directory under src/dstack/_internal/core/backends with the following files:
backend.pywith theBackendclass implementation. You typically don't need to modify it.compute.pywith theComputeclass implementation. This is the core of the backend that you need to implement.configurator.pywith theConfiguratorclass implementation. It deals with validating and storing backend config. You need to adjust it with custom backend config validation.models.pywith all the backend config models used byBackend,Compute,Configuratorand other parts ofdstack.
Go to models.py. It'll contain two config models required for all backends:
*BackendConfigthat contains all backend parameters available for user configuration except for creds.*BackendConfigWithCredsthat contains all backends parameters available for user configuration and also creds.
Adjust generated config models by adding additional config parameters.
Typically you'd need to only modify the *BackendConfig model since other models extend it.
Then add these models to AnyBackendConfig* unions in src/dstack/_internal/core/backends/models.py.
The script also generates *BackendStoredConfig that extends *BackendConfig to be able to store extra parameters in the DB. By the same logic, it generates *Config that extends *BackendStoredConfig with creds and uses it as the main Backend and Compute config instead of using *BackendConfigWithCreds directly.
Refer to examples: verda, aws, gcp, azure, etc.
Go to compute.py and implement Compute methods.
Optionally, extend and implement ComputeWith* classes to support additional features such as fleets, volumes, gateways, placement groups, etc. For example, extend ComputeWithCreateInstanceSupport to support fleets.
Refer to examples: verda, aws, gcp, azure, etc.
Go to configurator.py and implement custom Configurator logic. At minimum, you should implement creds validation.
You may also need to validate other config parameters if there are any.
Refer to examples: verda, aws, gcp, azure, etc.
Register configurator by appending it to _CONFIGURATOR_CLASSES in src/dstack/_internal/core/backends/configurators.py.
If instances in the backend take more than 10 minutes to start, override the default provisioning timeout in
src/dstack/_internal/server/background/tasks/common.py.
Add the backend to the Concepts->Backends page and the server/comfig.yml reference.
dstack supports two types of backend compute:
- VM-based
- Container-based
Used if the cloud provider allows provisioning virtual machines (VMs).
When dstack provisions a VM, it launches the dstack-shim agent inside the VM.
The agent controls the VM and starts Docker containers for users' jobs.
Since dstack controls the entire VM, VM-based backends can support more features,
such as blocks, instance volumes, privileged containers, and reusable instances.
Note, all VM-based backend Computes should sublass the ComputeWithPrivilegedSupport mixin,
as the dstack-shim agent provides this functionality OOTB.
To support a VM-based backend, dstack expects the following:
- An API for creating and terminating VMs
- An external IP and a public port for SSH
- Cloud-init (preferred)
- VM images with Ubuntu, OpenSSH, GPU drivers, and Docker with NVIDIA runtime
For some VM-based backends, the dstack team also maintains
custom VM images with the required dependencies
and dstack-specific optimizations.
Examples of VM-based backends include: aws, azure, gcp, lambda, verda, etc.
Used if the cloud provider only allows provisioning containers.
When dstack provisions a container, it launches the dstack-runner agent inside the container.
The agent accepts and runs users' jobs.
Since dstack doesn't control the underlying machine, container-based backends don't support some
dstack features, such as blocks, instance volumes, privileged containers, and reusable instances.
To support a container-based backend, dstack expects the following:
- An API for creating and terminating containers
- Containers properly configured to access GPUs
- An external IP and a public port for SSH
- A way to specify the Docker image
- A way to specify credentials for pulling images from private Docker registries
- A way to override the container entrypoint (at least ~2KB)
- A way to override the container user to root (as in
docker run --user root ...)
Examples of container-based backends include: kubernetes, vastai, runpod.