Conversation
There was a problem hiding this comment.
Nice idea — pre-warming the shader cache so every subsequent SimulationApp in the CI run starts fast. A few observations:
Overall
- The approach is sound: one throwaway
SimulationApplaunch compiles all RTX shaders into the on-disk cache, so the 4-5 test steps that follow skip the compilation tax. Clean and self-documenting.
scripts/warm_shader_cache.py
argparse.Namespaceconstruction — Manually assembling aNamespace(headless=True, enable_cameras=True, visualizer=[])is fragile; ifAppLauncherever adds a required arg or changes its defaults, this will silently break. Consider usingAppLauncher.add_app_launcher_args()to build the parser, then parse known args, or at minimum document why the hand-rolled namespace is sufficient.enable_cameras=True— Good call: this forces the rendering/RTX path so the shader compilation actually happens. Worth a short inline comment explaining why cameras are enabled (it's not obvious since this script doesn't render anything).- Error handling — If
AppLauncherorapp.close()throws (e.g., no GPU, missing Kit runtime), the CI step will still fail with a non-zero exit, which is fine. But atry/exceptwith a descriptive message could save debugging time on new runner configurations. - Minor: The
visualizerkwarg is passed as[]— is this needed? IfAppLauncherdefaultsvisualizertoNoneor[]already, this can be dropped for clarity.
.github/workflows/ci.yml
- The warm step is correctly placed before the first test step in both
testandtest_policyjobs. Since the two jobs run in separate containers (different images), each needs its own warm-up — good. - The warm step doesn't set a
timeout-minutesat the step level. Given shader compilation can take 60-180s (per the docstring), consider adding a step-level timeout (~5-10 min) as a safety net against hangs.
No blocking issues — this is a clean, well-scoped CI optimization. 👍
| def main(): | ||
| t0 = time.monotonic() | ||
| args = argparse.Namespace(headless=True, enable_cameras=True, visualizer=[]) | ||
| launcher = AppLauncher(args) |
There was a problem hiding this comment.
Nit: Hand-rolling argparse.Namespace couples this to AppLauncher's internal expectations. If the launcher ever adds validation or required fields, this will break without a clear error.
Consider:
parser = argparse.ArgumentParser()
AppLauncher.add_app_launcher_args(parser)
args = parser.parse_args(["--headless", "--enable_cameras"])This also self-documents the available flags.
| def main(): | ||
| t0 = time.monotonic() | ||
| args = argparse.Namespace(headless=True, enable_cameras=True, visualizer=[]) | ||
| launcher = AppLauncher(args) |
There was a problem hiding this comment.
Worth a brief comment explaining why enable_cameras=True — the point is to trigger the RTX/rendering shader compilation path, not to actually use cameras. Future readers might otherwise wonder.
| # | ||
| # To restore sanity here is a goal for Isaac Lab Arena v0.3. | ||
|
|
||
| - name: Warm Kit shader cache |
There was a problem hiding this comment.
Consider adding a step-level timeout-minutes: 10 here. The docstring says shader compilation can take 60-180s, but if Kit hangs (no GPU, license issues, etc.) you'd want the step to fail fast rather than burn the full job timeout.
## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570
Summary
Short description of the change (max 50 chars)
Detailed description