add elastic scheduling and make the system more stable by zhtshr · Pull Request #969 · ModelTC/LightX2V

zhtshr · 2026-03-31T07:50:52Z

This pull request introduces several significant enhancements and new features across the disaggregated service framework, primarily focusing on improved RDMA atomic operation support, dynamic instance management in the controller, and monitoring extensibility. The most impactful changes include full support for RDMA atomic verbs (fetch-and-add and compare-and-swap), controller logic for dynamic GPU/instance lifecycle management, and new hooks for reporting custom metrics. Additionally, there are improvements to configuration flexibility and command-line overrides.

RDMA atomic operations and server/client enhancements:

Added true remote atomic fetch-and-add (rdma_faa) and compare-and-swap (rdma_cas) operations to RDMAClient, including support for the corresponding RDMA opcodes and access flags. Both client and server now register memory regions with REMOTE_ATOMIC access, and QP attributes are updated accordingly. This enables real atomic operations over RDMA, replacing previous best-effort shims. [1] [2] [3] [4] [5] [6]
Updated example/test code to demonstrate usage of the new atomic RDMA operations, including writing, reading, fetch-and-add, and compare-and-swap.

Controller instance lifecycle and scheduling:

Implemented dynamic GPU/instance management in the controller, including:
- Per-GPU scheduling, cooldown/reuse logic, and idle pool management.
- Methods to create and reclaim encoder/transformer/decoder service instances as subprocesses, with robust port/state checks and error handling.
- Support for launching subprocesses with correct CUDA device visibility and configuration.
Added helper methods for mapping between instance addresses and monitor nodes, and for recursively converting configuration objects to plain Python types.

Monitoring and metrics extensibility:

Added support for registering an extra metrics provider in the Monitor class, allowing injection of custom metrics into the monitoring output in a thread-safe manner. [1] [2]

Configuration and CLI improvements:

Added a ranks field to the disaggregation config for explicit rank/GPU count configuration.
Introduced an --engine_rank command-line argument to override engine rank for service roles, and logic to apply this override in run_service.py. [1] [2]

Other improvements:

Miscellaneous: Added missing imports and typing annotations in the controller for robustness.

These changes collectively enable more robust, flexible, and scalable operation of the disaggregated video service platform.

gemini-code-assist

Code Review

This pull request introduces automated instance management and auto-scaling to the disaggregated service architecture. Key changes include a new controller for dynamic lifecycle management of service instances, hardware-supported RDMA atomic operations, and enhanced monitoring with internal queue metrics. Feedback identifies a potential deadlock in the controller's locking strategy, recommending reentrant locks, and points out an indexing mismatch in output file naming that could break test scripts. Suggestions were also made to improve subprocess logging and make backpressure thresholds configurable.

gemini-code-assist · 2026-03-31T07:53:50Z