[CI] Warm start CI before tests by xyao-nv · Pull Request #565 · isaac-sim/IsaacLab-Arena

xyao-nv · 2026-04-09T23:14:39Z

Summary

Short description of the change (max 50 chars)

Detailed description

What was the reason for the change?
What has been changed?
What is the impact of this change?

isaaclab-review-bot

Nice idea — pre-warming the shader cache so every subsequent SimulationApp in the CI run starts fast. A few observations:

Overall

The approach is sound: one throwaway SimulationApp launch compiles all RTX shaders into the on-disk cache, so the 4-5 test steps that follow skip the compilation tax. Clean and self-documenting.

scripts/warm_shader_cache.py

argparse.Namespace construction — Manually assembling a Namespace(headless=True, enable_cameras=True, visualizer=[]) is fragile; if AppLauncher ever adds a required arg or changes its defaults, this will silently break. Consider using AppLauncher.add_app_launcher_args() to build the parser, then parse known args, or at minimum document why the hand-rolled namespace is sufficient.
enable_cameras=True — Good call: this forces the rendering/RTX path so the shader compilation actually happens. Worth a short inline comment explaining why cameras are enabled (it's not obvious since this script doesn't render anything).
Error handling — If AppLauncher or app.close() throws (e.g., no GPU, missing Kit runtime), the CI step will still fail with a non-zero exit, which is fine. But a try/except with a descriptive message could save debugging time on new runner configurations.
Minor: The visualizer kwarg is passed as [] — is this needed? If AppLauncher defaults visualizer to None or [] already, this can be dropped for clarity.

.github/workflows/ci.yml

The warm step is correctly placed before the first test step in both test and test_policy jobs. Since the two jobs run in separate containers (different images), each needs its own warm-up — good.
The warm step doesn't set a timeout-minutes at the step level. Given shader compilation can take 60-180s (per the docstring), consider adding a step-level timeout (~5-10 min) as a safety net against hangs.

No blocking issues — this is a clean, well-scoped CI optimization. 👍

isaaclab-review-bot · 2026-04-09T23:50:21Z

+def main():
+    t0 = time.monotonic()
+    args = argparse.Namespace(headless=True, enable_cameras=True, visualizer=[])
+    launcher = AppLauncher(args)


Nit: Hand-rolling argparse.Namespace couples this to AppLauncher's internal expectations. If the launcher ever adds validation or required fields, this will break without a clear error.

Consider:

parser = argparse.ArgumentParser() AppLauncher.add_app_launcher_args(parser) args = parser.parse_args(["--headless", "--enable_cameras"])

This also self-documents the available flags.

isaaclab-review-bot · 2026-04-09T23:50:21Z

+def main():
+    t0 = time.monotonic()
+    args = argparse.Namespace(headless=True, enable_cameras=True, visualizer=[])
+    launcher = AppLauncher(args)


Worth a brief comment explaining why enable_cameras=True — the point is to trigger the RTX/rendering shader compilation path, not to actually use cameras. Future readers might otherwise wonder.

isaaclab-review-bot · 2026-04-09T23:50:21Z

      #
      # To restore sanity here is a goal for Isaac Lab Arena v0.3.
+
+      - name: Warm Kit shader cache


Consider adding a step-level timeout-minutes: 10 here. The docstring says shader compilation can take 60-180s, but if Kit hangs (no GPU, license issues, etc.) you'd want the step to fail fast rather than burn the full job timeout.

## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570

xyao-nv added 2 commits April 9, 2026 16:12

warm start

4e473e9

Update warm_shader_cache.py

577b001

isaaclab-review-bot Bot reviewed Apr 9, 2026

View reviewed changes

xyao-nv mentioned this pull request Apr 10, 2026

[CI] Fix CI subprocess test hangs #557

Merged

xyao-nv changed the title ~~Exp: warm start CI before tests~~ [CI] Warm start CI before tests Apr 10, 2026

Merge branch 'main' into xyao/exp/ci_warm_start

d409913

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Warm start CI before tests#565

[CI] Warm start CI before tests#565
xyao-nv wants to merge 3 commits intomainfrom
xyao/exp/ci_warm_start

xyao-nv commented Apr 9, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

isaaclab-review-bot Bot Apr 9, 2026

Uh oh!

isaaclab-review-bot Bot Apr 9, 2026

Uh oh!

isaaclab-review-bot Bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xyao-nv commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Detailed description

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xyao-nv commented Apr 9, 2026 •

edited

Loading