Skip to content

[CI] Warm start CI before tests#565

Draft
xyao-nv wants to merge 3 commits intomainfrom
xyao/exp/ci_warm_start
Draft

[CI] Warm start CI before tests#565
xyao-nv wants to merge 3 commits intomainfrom
xyao/exp/ci_warm_start

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented Apr 9, 2026

Summary

Short description of the change (max 50 chars)

Detailed description

  • What was the reason for the change?
  • What has been changed?
  • What is the impact of this change?
image

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea — pre-warming the shader cache so every subsequent SimulationApp in the CI run starts fast. A few observations:

Overall

  • The approach is sound: one throwaway SimulationApp launch compiles all RTX shaders into the on-disk cache, so the 4-5 test steps that follow skip the compilation tax. Clean and self-documenting.

scripts/warm_shader_cache.py

  1. argparse.Namespace construction — Manually assembling a Namespace(headless=True, enable_cameras=True, visualizer=[]) is fragile; if AppLauncher ever adds a required arg or changes its defaults, this will silently break. Consider using AppLauncher.add_app_launcher_args() to build the parser, then parse known args, or at minimum document why the hand-rolled namespace is sufficient.
  2. enable_cameras=True — Good call: this forces the rendering/RTX path so the shader compilation actually happens. Worth a short inline comment explaining why cameras are enabled (it's not obvious since this script doesn't render anything).
  3. Error handling — If AppLauncher or app.close() throws (e.g., no GPU, missing Kit runtime), the CI step will still fail with a non-zero exit, which is fine. But a try/except with a descriptive message could save debugging time on new runner configurations.
  4. Minor: The visualizer kwarg is passed as [] — is this needed? If AppLauncher defaults visualizer to None or [] already, this can be dropped for clarity.

.github/workflows/ci.yml

  • The warm step is correctly placed before the first test step in both test and test_policy jobs. Since the two jobs run in separate containers (different images), each needs its own warm-up — good.
  • The warm step doesn't set a timeout-minutes at the step level. Given shader compilation can take 60-180s (per the docstring), consider adding a step-level timeout (~5-10 min) as a safety net against hangs.

No blocking issues — this is a clean, well-scoped CI optimization. 👍

def main():
t0 = time.monotonic()
args = argparse.Namespace(headless=True, enable_cameras=True, visualizer=[])
launcher = AppLauncher(args)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Hand-rolling argparse.Namespace couples this to AppLauncher's internal expectations. If the launcher ever adds validation or required fields, this will break without a clear error.

Consider:

parser = argparse.ArgumentParser()
AppLauncher.add_app_launcher_args(parser)
args = parser.parse_args(["--headless", "--enable_cameras"])

This also self-documents the available flags.

def main():
t0 = time.monotonic()
args = argparse.Namespace(headless=True, enable_cameras=True, visualizer=[])
launcher = AppLauncher(args)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a brief comment explaining why enable_cameras=True — the point is to trigger the RTX/rendering shader compilation path, not to actually use cameras. Future readers might otherwise wonder.

Comment thread .github/workflows/ci.yml
#
# To restore sanity here is a goal for Isaac Lab Arena v0.3.

- name: Warm Kit shader cache
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a step-level timeout-minutes: 10 here. The docstring says shader compilation can take 60-180s, but if Kit hangs (no GPU, license issues, etc.) you'd want the step to fail fast rather than burn the full job timeout.

@xyao-nv xyao-nv changed the title Exp: warm start CI before tests [CI] Warm start CI before tests Apr 10, 2026
xyao-nv added a commit that referenced this pull request Apr 10, 2026
## Summary
Subprocess-spawning tests hang indefinitely on CI.

## Causes & Fixes

### Problems

From Lab:

1. Lab reports "AppLauncher doesnt quit properly after app.close(),
app.quit() doesn't help either."
2. Cold startup times for tests using IS can be upwards of 10 min on Lab
CI machines.

Above issues apply to us, because tests hang during sub-process tests
section, between the end of last test and the beginning of the next
test. See detailed logs and analysis from reproducing locally
[here](#568)

### Fixes

1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang
indefinitely in Kit's shutdown path) when the env var
`ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set.
Calls a new `_kill_child_processes()` helper that walks `/proc` to
`SIGKILL` all direct children before doing `os._exit(0)`, preventing
orphaned Kit processes from holding GPU resources.

2. `run_subprocess` has a configuarable wall-clock timeouts and process
isolation, such that when needed, it could trigger the force exit path
above.

3. Add wall-clock timing and logging inside the SimulationApp start
method. Keep track of how much startup time is taking on CI.

## Minor fixes

1. Add timing stats into pytest cmds such that it reports the slowests
test func at the end of each section.

2. Parametrize multi-config tests: Convert nested for-loops in
`test_zero_action_policy_kitchen_pick_and_place` (6 configs) and
`test_zero_action_policy_gr1_open_microwave` (3 configs) into
`@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail,
and timing.

3. Reduce num_envs in gr00t eval_runner test to speed up.

### Local validation
With the repro script
#568, I do not have
local stalling.
Log for more details.

[repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log)


### CI Before -- timeout
<img width="1219" height="170" alt="image"
src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00"
/>

### CI After
<img width="1219" height="170" alt="image"
src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d"
/>
<img width="1198" height="472" alt="image"
src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2"
/>

## TODOs
- test_camera_observations takes 10mins to start the app due to Kit cold
start. Experimenting with a warm start before tests process here
#565
- Kit itself intermittently deadlocks during startup — not because of
orphans, but because Kit's internal thread synchronization fails on
low-CPU runners. Experimenting with retry here
#570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant