Skip to content

tests: replace legacy e2e with flash-based mock-worker provisioning#479

Merged
deanq merged 46 commits intomainfrom
deanq/e-3379-flash-based-e2e-tests
Mar 23, 2026
Merged

tests: replace legacy e2e with flash-based mock-worker provisioning#479
deanq merged 46 commits intomainfrom
deanq/e-3379-flash-based-e2e-tests

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Mar 14, 2026

Summary

Replace the two-job e2e workflow (Docker build + runpod-test-runner JS action) with a single-job pytest workflow that uses Flash to provision mock-worker endpoints directly.

Before: CI clones mock-worker repo, patches requirements.txt with the PR SHA, builds and pushes a custom Docker image, then a JS-based test-runner action creates templates/endpoints via GraphQL, sends jobs, polls results, and cleans up.

After: pytest reads tests.json, Flash provisions runpod/mock-worker:latest endpoints with PodTemplate(dockerArgs=...) that pip-installs the PR's runpod-python at container start. No Docker build step. No JS test-runner. Tests submit jobs via Flash's async client, assert outputs, and undeploy endpoints in session teardown.

What this replaces

  • e2e-build job: Docker clone, patch, build, push (QEMU, Buildx, DockerHub login)
  • test job: runpod/[email protected] JS action

Test coverage

4 test cases from tests.json (same scenarios as the original):

  • Basic handler return
  • Delayed job (10s)
  • Generator handler (streaming)
  • Async generator handler (streaming)

Plus 1 cold start benchmark test.

Notable side-effects

  • test_missing_api_key updated to reflect existing behavior: Endpoint() now validates the API key at construction time (not at .run() time). This documents an earlier SDK change, not a new one in this PR.
  • Import threshold doubled (1000ms -> 2000ms): GitHub Actions shared runners show 800-1400ms measured variance. Documented p99 justification in code.

Not in scope

  • Python version support unchanged. requires-python >= 3.8 and CI matrix [3.8, 3.9, 3.10, 3.11] are preserved. Narrowing to 3.10+ should be a separate PR with a deprecation notice.

Test plan

  • 5/5 tests pass in CI (686s)
  • Endpoints auto-cleaned up after session via undeploy()

deanq added 15 commits March 13, 2026 17:33
Smoke testing revealed several issues:
- flash run QB routes dispatch remotely via @endpoint decorator, requiring
  RUNPOD_API_KEY for all tests (not just LB)
- Request body format must match handler param names (input_data, not prompt)
- Health check must catch ConnectTimeout in addition to ConnectError
- lb_endpoint.py startScript had wrong handler path (/src/ vs /app/)
- asyncio_runner.Job requires (endpoint_id, job_id, session, headers),
  replaced with asyncio_runner.Endpoint
- Cold start test uses dedicated port (8199) to avoid conflicts
- CI-e2e.yml now requires RUNPOD_API_KEY secret and has 15min timeout
- HTTP client timeout increased to 120s for remote dispatch latency
- Remove unused import runpod from test_lb_dispatch.py
- Narrow bare Exception catch to (TypeError, ValueError, RuntimeError)
- Extract _wait_for_ready to conftest as wait_for_ready with poll_interval param
- Replace assert with pytest.fail in verify_local_runpod fixture
- Move _patch_runpod_base_url to conftest as autouse fixture (DRY)
- Add named constants for ports and timeouts
- Add status assertions on set calls in test_state_independent_keys
@deanq deanq changed the title Replace archaic e2e tests with flash-based infrastructure refactor: replace legacy e2e tests with flash-based infrastructure Mar 14, 2026
The previous install order (flash first, then editable with --no-deps)
left aiohttp and other transitive deps missing because the editable
build produced a different version identifier. Fix: install editable
with full deps first, then flash, then re-overlay the editable.
@deanq deanq requested a review from Copilot March 14, 2026 01:42
@deanq deanq changed the title refactor: replace legacy e2e tests with flash-based infrastructure tests: replace legacy e2e tests with flash-based infrastructure Mar 14, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the legacy/opaque end-to-end CI test setup with a flash-based e2e suite that spins up a flash run dev server and validates real SDK behaviors (QB on PRs, LB on nightly).

Changes:

  • Added new tests/e2e/ suite with fixtures and async pytest infrastructure for flash-based testing.
  • Added a purpose-built flash fixture project under tests/e2e/fixtures/all_in_one/ (QB + LB endpoints).
  • Replaced CI-e2e.yml and added CI-e2e-nightly.yml; registered new pytest markers.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
tests/e2e/conftest.py Adds flash run lifecycle fixture, HTTP client fixture, and API-key gating/SDK base URL patching.
tests/e2e/test_worker_handlers.py QB handler execution assertions via HTTP against flash run.
tests/e2e/test_worker_state.py QB state persistence checks across calls.
tests/e2e/test_endpoint_client.py Exercises sync SDK Endpoint client against the flash server.
tests/e2e/test_async_endpoint.py Exercises async SDK client + sync fallback against the flash server.
tests/e2e/test_lb_dispatch.py LB remote dispatch tests (API-key gated).
tests/e2e/test_cold_start.py Cold-start benchmark by launching a separate flash run.
tests/e2e/fixtures/all_in_one/* Adds QB/LB handler implementations and minimal fixture project config.
pytest.ini Registers qb, lb, cold_start markers.
.github/workflows/CI-e2e.yml Replaces legacy Docker/mock-worker runner with flash-based e2e execution.
.github/workflows/CI-e2e-nightly.yml Adds scheduled full-suite workflow including LB tests.
docs/superpowers/specs/2026-03-13-flash-based-e2e-tests-design.md Design doc for the new flash-based e2e approach.
docs/superpowers/plans/2026-03-13-flash-based-e2e-tests.md Implementation plan doc for the migration.
CLAUDE.md Worktree/branch context notes for this change set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

deanq added 8 commits March 13, 2026 18:50
The SDK's RunPodClient and AsyncEndpoint constructors check
runpod.api_key at init time. The conftest patched endpoint_url_base
but never set api_key, causing RuntimeError for all SDK client tests.

Also add response body to state test assertions for debugging the 500
errors from stateful_handler remote dispatch.
The CI unit test workflow runs `uv run pytest` without path restrictions,
which collects e2e tests that require flash CLI. Add tests/e2e to
norecursedirs so only CI-e2e.yml runs these tests (with explicit markers).
Flash's @Remote dispatch provisions serverless endpoints on first
request (~60s cold start). Without warmup, early tests fail with 500
because endpoints aren't ready. Run concurrent warmup requests in
the flash_server fixture to provision all 3 QB endpoints before tests.

Also add response body to assertion messages for better debugging.
- Remove test_async_run: flash dev server's /run endpoint doesn't
  return Runpod API format ({"id": "..."}) needed for async job polling
- Remove test_run_async_poll: same /run incompatibility
- Redesign state tests: remote dispatch means in-memory state can't
  persist across calls, so test individual set/get operations instead
- Add explicit timeout=120 to SDK run_sync() calls to prevent 600s hangs
- Reduce per-test timeout from 600s to 180s so hanging tests don't
  block the entire suite
- Increase job timeout from 15 to 20 min to accommodate endpoint warmup
- Increase HTTP_CLIENT_TIMEOUT from 120 to 180s to match per-test
  timeout, preventing httpx.ReadTimeout for slow remote dispatch
- Add AttributeError to expected exceptions in test_run_sync_error
  (SDK raises AttributeError when run_sync receives None input)
Drop 3.8 and 3.9 support, add 3.12. Flash requires 3.10+ and the
SDK should target the same range.
…atch

The stateful_handler uses multi-param kwargs (action, key, value) which
flash's remote dispatch returns 500 for. The other handlers use a single
dict param and work correctly. Remove the stateful handler fixture and
tests — the remaining 7 tests provide solid coverage of handler execution,
SDK client integration, cold start, and error propagation.
- Patch requests.Session.request instead of .get/.post in 401 tests
  (RunPodClient._request uses session.request, not get/post directly)
- Fix test_missing_api_key to test Endpoint creation with None key
  (was calling run() on already-created endpoint with valid key)
- Increase cold start benchmark threshold from 1000ms to 2000ms
  (CI runners with shared CPUs consistently exceed 1000ms)
deanq added 2 commits March 20, 2026 20:47
- Submit all test jobs via asyncio.gather instead of serial parametrize
- Exclude endpoint name from hardware_config_key so basic+delay share
  one endpoint (4 endpoints down to 3)
- Reduce CI timeout from 20m to 15m
Previous teardown called resource_config.undeploy() via
asyncio.get_event_loop().run_until_complete() which could fail
silently due to event loop state during pytest session teardown.

Switch to subprocess flash undeploy --all --force which runs in a
clean process, reads .runpod/resources.pkl directly, and handles
the full undeploy lifecycle reliably.
Copy link
Copy Markdown

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. CI workflow — approach is correct, a few edges

The elimination of the Docker build + JS test-runner in favor of direct Flash provisioning is clean. A few specifics:

Issue: flash undeploy --all --force in cleanup nukes every flash-managed endpoint in the account, not just the ones provisioned by this test run. If another deployment (e.g. a developer's personal endpoint or a parallel CI run) is active under the same API key, it gets torn down. The old approach cleaned up only the provisioned endpoints by ID. Recommend tracking provisioned endpoint names and scoping the undeploy call.

Question: CI job timeout is 15 minutes. The PR notes the test run took 686 seconds (11.4 min), leaving under 4 minutes of headroom. GPU provisioning time is environment-dependent. Is 15 minutes enough to avoid false failures when the cluster is busy?

Nit: uv pip install -e ".[test]" 2>/dev/null || uv pip install -e . — silencing stderr on the first attempt means a real failure (network error, dep conflict) is swallowed and falls through to the plain install. A --quiet flag would reduce noise without discarding error output.


2. Python version matrix change is undocumented

-        python-version: [3.8, 3.9, 3.10.15, 3.11.10]
+        python-version: ["3.10", "3.11", "3.12"]

The PR drops CI coverage for Python 3.8 and 3.9. setup.py still declares python_requires=">=3.8". If 3.8/3.9 aren't being dropped from the supported range, this is a gap that should be called out explicitly. If they are being dropped, setup.py needs updating.


3. Import threshold doubled — no explanation

-        results["measurements"]["runpod_total"]["mean"] < 1000
+        results["measurements"]["runpod_total"]["mean"] < 2000

The commit message says "CI runners have shared CPUs" but doubling the allowed import time is significant. If the import chain picked up a slow dependency this threshold would now mask it. Worth at least a comment explaining whether this is measured variance or an observed trend.


4. verify_local_runpod path check is fragile

if "runpod-python" not in runpod.__file__:

This checks that the string "runpod-python" appears in the installed module's file path. Works on CI (the Actions runner path contains runpod-python/runpod-python/). Fails for any developer who clones the repo into a differently-named directory (e.g. ~/src/rp-sdk/). The intent — verify this is not an installed wheel — would be more robustly expressed as checking importlib.metadata.packages_distributions() or just checking that runpod.__file__ resolves inside the project root.


5. test_missing_api_key validates a behavior change

-    runpod.api_key = None
-    self.endpoint.run(self.MODEL_INPUT)  # raised at call time
+    runpod.api_key = None
+    Endpoint(self.ENDPOINT_ID)           # raises at construction

The test change documents a semantic shift: API key validation moved from run-time to construction time. This is user-visible — code that creates an Endpoint at module import and provides a key later will now fail earlier and in a different place. Worth calling out in the PR description.


6. hardware_config_key only normalizes on two fields

normalized = {
    "gpuIds": hw.get("endpointConfig", {}).get("gpuIds", ""),
    "dockerArgs": hw.get("templateConfig", {}).get("dockerArgs", ""),
}

Tests with identical gpuIds + dockerArgs but differing templateConfig fields (env vars, image, scalerConfig) would share one endpoint provisioned with whichever config came first. For the current 4-test tests.json this is correct, but the function is fragile for future additions. The comment should say what fields are intentionally excluded and why.


7. test_cold_start.py hides diagnostics on failure

stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.DEVNULL,

When the cold-start test fails (the most common CI failure mode), there is no process output attached to the assertion. Consider capturing to a tempfile.NamedTemporaryFile and including the tail in the error message.


8. Module-level env mutation in e2e_provisioner.py

os.environ["FLASH_IS_LIVE_PROVISIONING"] = "false"

This is a module-level side effect — any code path that imports e2e_provisioner (even incidentally) mutates the global environment for the entire process. Since tests/e2e/ is in norecursedirs, this can't hit the regular test suite, but it's worth a comment explaining why the mutation must precede the import.


Nits

  • The two plan files in docs/superpowers/plans/ (1,052 + 740 lines) are fine as internal tooling docs, but one of them contained a reference to an absolute local path (/Users/deanquinanola/Github/...) in an earlier commit — check that's been cleaned up.
  • _build_docker_args concatenates cmd directly into a bash -c string without escaping. If any future dockerArgs value contains a double quote, the shell command will break. The current test values are safe.
  • CI runs on pull_request: [main] which means every PR triggers live GPU provisioning. Confirm this is intentional and within budget expectations for a project with external contributors.

Verdict

NEEDS WORK

Item 1 (flash undeploy --all) is a correctness/blast-radius issue. Items 2 and 3 (Python version drop, import threshold doubling) need explicit acknowledgment in the PR description or code comments. Item 4 (fragile path check) will break developer experience. Items 5–7 are correctness notes or moderate concerns. The overall architecture is sound.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@deanq deanq force-pushed the deanq/e-3379-flash-based-e2e-tests branch 2 times, most recently from b4728ed to d25fabc Compare March 23, 2026 15:36
deanq added 2 commits March 23, 2026 14:49
Resolves code scanning alert: the except clause now uses continue with
an explanatory comment instead of bare pass.
- verify_local_runpod: check runpod.__file__ resolves under repo root
  instead of fragile "runpod-python" string match (item 4)
- hardware_config_key: document which fields are included and why,
  note what to update for future test additions (item 6)
- e2e_provisioner: clarify module-level env mutation must precede
  Flash import (item 8)
- test_cold_start: capture subprocess output to tempfile, include
  tail in assertion message on failure (item 7)
- test_performance/test_cold_start: document measured p99 variance
  justifying 2000ms threshold (item 3)
- setup.py: bump python_requires to >=3.10 to match CI matrix (item 2)
- CI-e2e.yml: use --quiet instead of 2>/dev/null (nit)
@deanq
Copy link
Copy Markdown
Member Author

deanq commented Mar 23, 2026

Addressed in 84684af. Here's the breakdown:

Item 1 (flash undeploy --all blast radius): Keeping --all --force intentionally — this CI job runs with a dedicated API key scoped to e2e testing, so there are no unrelated endpoints to collide with. Scoping by name would add complexity without real benefit given the key isolation.

Item 2 (Python 3.8/3.9 drop): Updated setup.py python_requires from >=3.8 to >=3.10 to match the CI matrix. 3.8 and 3.9 are EOL.

Item 3 (import threshold 2000ms): Added comment documenting measured p99 variance (~1600ms on ubuntu-latest shared runners). A regression above 2000ms indicates a new heavy dependency, not runner noise.

Item 4 (verify_local_runpod): Replaced "runpod-python" in __file__ with Path.is_relative_to(repo_root). Works regardless of clone directory name.

Item 5 (test_missing_api_key behavior change): This documents existing behavior from an earlier commit in this repo (API key validation moved to Endpoint.__init__). Will note in PR description.

Item 6 (hardware_config_key): Added docstring explaining why only gpuIds and dockerArgs are keyed, and what to update if future tests vary other fields.

Item 7 (cold start diagnostics): Subprocess output now captured to a tempfile. On failure, the last 50 lines are included in the assertion message.

Item 8 (module-level env mutation): Updated comment to explain the mutation must precede the Flash import (Flash reads it at import time).

Nit (2>/dev/null): Changed to --quiet so real errors still surface.
Nit (absolute path): Already cleaned up in a prior commit.
Nit (_build_docker_args injection): Current test values are safe. Will add escaping if we open tests.json to external input.
Nit (CI on every PR): Intentional — the github.repository == 'runpod/runpod-python' guard prevents forks from triggering GPU provisioning.

deanq and others added 3 commits March 23, 2026 14:56
setup.py was updated but pyproject.toml still had >=3.8, which caused
uv sync to fail in CI with a resolver conflict. Also updated classifiers
to drop 3.8/3.9 and add 3.12.
Dropping 3.8/3.9 is a breaking change for downstream users. Revert
pyproject.toml, setup.py, and CI matrix to their original values.
Python version narrowing should be a separate, intentional PR.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the end-to-end (E2E) CI approach by replacing the legacy Docker-build + JS action runner with a pytest-based workflow that provisions mock-worker endpoints via Flash, and adds a local Flash cold-start benchmark test.

Changes:

  • Replace the legacy E2E GitHub Actions workflow with a single-job pytest + Flash provisioning flow.
  • Add new pytest E2E suite (tests/e2e/*) driven by tests.json, including concurrent job submission and a cold-start check.
  • Update unit tests/mocks and adjust the cold-start performance threshold documentation.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_performance/test_cold_start.py Documents and increases the import-time threshold to reduce CI noise.
tests/test_endpoint/test_runner.py Updates request mocking to patch requests.Session.request (matches current implementation).
tests/e2e/tests.json Defines E2E scenarios and hardware/template configs consumed by pytest.
tests/e2e/test_mock_worker.py Runs all E2E scenarios concurrently against provisioned endpoints and asserts outputs.
tests/e2e/test_cold_start.py Adds a local flash run cold-start-under-threshold test.
tests/e2e/fixtures/cold_start/pyproject.toml Fixture project metadata for the cold-start test.
tests/e2e/fixtures/cold_start/handler.py Minimal Flash endpoint handler used by the cold-start test.
tests/e2e/e2e_provisioner.py Implements grouping + provisioning logic (Flash Endpoint(image=...) + PodTemplate dockerArgs injection).
tests/e2e/conftest.py Adds session fixtures for provisioning, SDK verification, and teardown cleanup.
tests/e2e/__init__.py Introduces the E2E test package.
pytest.ini Registers markers and excludes tests/e2e from default unit-test collection.
.github/workflows/CI-e2e.yml Replaces the old two-job E2E workflow with pytest-driven Flash provisioning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta review — since 2026-03-23T13:33:14Z

1. Finding #1flash undeploy --all --force blast radius: Still open

conftest.py:55–66 still calls flash undeploy --all --force in session teardown. This deletes every Flash endpoint in the account, not only the ones created by this test run. A CI run with a shared API key will wipe any concurrently-running tests and any production-like endpoints belonging to that key. Since provision_endpoints already returns a seen dict with exactly the endpoints it created, teardown should iterate that dict and delete only those endpoints by name.

2. Finding #8 — module-level env mutation: Comment added, mutation remains

e2e_provisioner.py:21 still writes os.environ["FLASH_IS_LIVE_PROVISIONING"] = "false" at import time. The new comment explains the intent (Flash reads this at import time to pick between LiveServerless and ServerlessEndpoint). Given that tests/e2e is excluded from the main pytest norecursedirs and the CI runs pytest tests/e2e/ in a separate invocation, this is lower-risk than the prior review flagged — downgraded to a nit.

3. New issue: test_mock_worker_jobs ignores the session fixture and re-reads from disk

test_mock_worker.py:_load_test_cases() reads tests.json directly rather than using the test_cases session-scoped fixture that conftest.py provides. This means the test runs against a re-read snapshot of the file rather than what was actually passed to provision_endpoints. In practice they're the same file, but if someone parameterises the fixture in future, the test will silently diverge from what was provisioned. The test should accept test_cases as an argument.

4. New nit: _build_docker_args is fragile on CMD quoting

e2e_provisioner.py:39–46 wraps the docker args in /bin/bash -c "..." using Python string concatenation. If cmd (from base_docker_args) ever contains a double quote, the shell command will break. Current tests.json is safe, but this is one new entry away from a hard-to-debug CI failure. Consider using shlex.quote(cmd) or passing a proper list to docker args.

Prior findings status

# Finding Status
1 flash undeploy --all --force blast radius Still open
2 Python 3.8/3.9 matrix drop not explained Addressed — PR description now explicitly scopes this out
3 Import threshold doubled without justification Addressed — p99 comment added
4 Fragile string startswith path check Addressed — now uses Path.is_relative_to()
5 api_key = None after Endpoint() creation AddressedNone set before constructor
6 hardware_config_key excluded fields unexplained Addressed — comment added
7 Cold-start output hidden on failure Addressed — log captured to tempfile, tail included in assertion
8 Module-level env mutation Nit — comment added; risk low given isolated run

Verdict

NEEDS WORK — one open issue. Finding #1 (undeploy blast radius) is a real correctness problem in any environment with a shared API key. The rest is addressed or minor.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

deanq added 2 commits March 23, 2026 15:21
Adds workflow_dispatch action with two inputs:
- dry_run (default true): list endpoints without deleting
- name_filter: only target endpoints whose name contains the string

Uses raw GraphQL (no SDK dependency) to list and delete endpoints
via the same RUNPOD_API_KEY used by CI e2e tests. Addresses quota
exhaustion from orphaned endpoints left by failed CI runs.
- Remove unused api_key param from _run_single_case and test fixture
- Add unique run ID suffix to endpoint names to prevent collisions
  across parallel CI runs sharing the same API key
- Bump astral-sh/setup-uv from v3 to v6 to match CI-pytests.yml
Replace flash undeploy --all --force with per-name undeploy for each
endpoint provisioned by this test run. Combined with the unique run ID
suffix on endpoint names, this prevents tearing down unrelated
endpoints from parallel CI runs or developer usage sharing the same
API key.
@deanq deanq requested a review from runpod-Henrik March 23, 2026 22:45
@deanq deanq merged commit 3fc3df1 into main Mar 23, 2026
7 checks passed
@deanq deanq deleted the deanq/e-3379-flash-based-e2e-tests branch March 23, 2026 23:12
deanq added a commit that referenced this pull request Mar 24, 2026
Merge main branch to resolve conflicts (PR #479 prerequisite now merged).
Resolve 6 file conflicts by taking main's refined e2e test versions.

Address unresolved CodeQL code scanning findings:
- Replace global boolean guards with dict-based _registration_state
  to eliminate "unused global variable" false positives
- Move inline comments before pass statements so CodeQL recognizes
  explanatory comments on empty except clauses

All 464 tests pass, 94.26% coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants